SURROGATE LOSSES IN PASSIVE AND ACTIVE LEARNING 



By Steve Hanneke and Liu Yang 
Carnegie Mellon University 

Active learning is a type of sequential design for supervised machine 
learning, in which the learning algorithm sequentially requests the labels of 
selected instances from a large pool of unlabeled data points. The objective is 
to produce a classifier of relatively low risk, as measured under the 0-1 loss, 
ideally using fewer label requests than the number of random labeled data 
points sufficient to achieve the same. This work investigates the potential 
uses of surrogate loss functions in the context of active learning. Specifically, 
it presents an active learning algorithm based on an arbitrary classification- 
calibrated surrogate loss function, along with an analysis of the number of 
label requests sufficient for the classifier returned by the algorithm to achieve 
a given risk under the 0-1 loss. Interestingly, these results cannot be obtained 
by simply optimizing the surrogate risk via active learning to an extent suf- 
ficient to provide a guarantee on the 0-1 loss, as is common practice in the 
analysis of surrogate losses for passive learning. Some of the results have 
additional implications for the use of surrogate losses in passive learning. 



1. Introduction. In supervised machine learning, we are tasked with learning 
a classifier whose probability of making a mistake (i.e., error rate) is small. The 
study of when it is possible to learn an accurate classifier via a computationally 
efficient algorithm, and how to go about doing so, is a subtle and difficult topic, 
owing largely to nonconvexity of the loss function: namely, the 0-1 loss. While 
there is certainly an active literature on developing computationally efficient meth- 
ods that succeed at this task, even under various noise conditions, it seems fair 
to say that at present, many of these advances have not yet reached the level of 
robustness, efficiency, and simplicity required for most applications. In the mean 
time, practitioners have turned to various heuristics in the design of practical learn- 
ing methods, in attempts to circumvent these tough computational problems. One 
of the most common such heuristics is the use of a convex surrogate loss function 
in place of the 0-1 loss in various optimizations performed by the learning method. 
The convexity of the surrogate loss allows these optimizations to be performed effi- 
ciently, so that the methods can be applied within a reasonable execution time, even 
with only modest computational resources. Although classifiers arrived at in this 



AMS 2000 subject classifications: Primary 62L05, 68Q32, 62H30, 68T05; secondary 68T10, 
68Q10, 68Q25, 68W40, 62G99 

Keywords and phrases: active learning, sequential design, selective sampling, statistical learning 
theory, surrogate loss functions, classification 



1 



2 



HANNEKE AND YANG 



way are not always guaranteed to be good classifiers when performance is mea- 
sured under the 0-1 loss, in practice this heuristic has often proven quite effective. 
In light of this fact, most modern learning methods either explicitly make use of 
a surrogate loss in the formulation of optimization problems (e.g., SVM), or im- 
plicitly optimize a surrogate loss via iterative descent (e.g., AdaBoost). Indeed, the 
choice of a surrogate loss is often as fundamental a part of the process of approach- 
ing a learning problem as the choice of hypothesis class or learning bias. Thus it 
seems essential that we come to some understanding of how best to make use of 
suiTogate losses in the design of learning methods, so that in the favorable scenario 
that this heuristic actually does work, we have methods taking full advantage of it. 

In this work, we are primarily interested in how best to use surrogate losses in 
the context of active learning, which is a type of sequential design in which the 
learning algorithm is presented with a large pool of unlabeled data points (i.e., 
only the covariates are observable), and can sequentially request to observe the 
labels (response variables) of individual instances from the pool. The objective in 
active learning is to produce a classifier of low error rate while accessing a smaller 
number of labels than would be required for a method based on random labeled 
data points (i.e., passive learning) to achieve the same. We take as our starting 
point that we have already committed to use a given surrogate loss, and we restrict 
our attention to just those scenarios in which this heuristic actually does work. We 
are then interested in how best to make use of the surrogate loss toward the goal of 
producing a classifier with relatively small error rate. To be clear - , we focus on the 
case where the minimizer of the surrogate risk also minimizes the error rate, and is 
contained in our function class. 

We construct an active learning strategy based on optimizing the empirical sur- 
rogate risk over increasingly focused subsets of the instance space, and derive 
bounds on the number of label requests the method requires to achieve a given 
error rate. Interestingly, we find that the basic approach of optimizing the surrogate 
risk via active learning to a sufficient extent to guarantee small error rate generally 
does not lead to as strong of results. In fact, the method our results apply to typi- 
cally does not optimize the surrogate risk (even in the limit). The insight leading to 
this algorithm is that, if we are truly only interested in achieving low 0-1 loss, then 
once we have identified the sign of the optimal function at a given point, we need 
not optimize the value of the function at that point any further, and can therefore 
focus the label requests elsewhere. As a byproduct of this analysis, we find this 
insight has implications for the use of certain surrogate losses in passive learning 
as well, though to a lesser extent. 

Most of the mathematical tools used in this analysis are inspired by recently- 
developed techniques for the study of active learning [18, 19, 25], in conjunction 
with the results of Bartlett, Jordan, and McAuliffe [6] bounding the excess er- 



SURROGATE LOSSES 



3 



ror rate in terms of the excess surrogate risk, and the works of Koltchinskii [23] 
and Bartlett, Bousquet, and Mendelson [7] on localized Rademacher complexity 
bounds. 

1.1. Related Work. There are many previous works on the topic of surrogate 
losses in the context of passive learning. Perhaps the most relevant to our results 
below are the work of Bartlett, Jordan, and McAuliffe [6] and the related work of 
Zhang [38]. These develop a general theory for converting results on excess risk 
under the surrogate loss into results on excess risk under the 0-1 loss. Below, we 
describe the conclusions of that work in detail, and we build on many of the basic 
definitions and insights pioneered in these works. 

Another related line of research, initiated by Audibert and Tsybakov [2], stud- 
ies "plug-in rules," which make use of regression estimates obtained by optimiz- 
ing a surrogate loss, and are then rounded to {— 1,+1} values to obtain classi- 
fiers. They prove results under smoothness assumptions on the actual regression 
function, which (remarkably) are often better than the known results for methods 
that directly optimize the 0-1 loss. Under similar conditions, Minsker [28] studies 
an analogous active learning method, which again makes use of a surrogate loss, 
and obtains improvements in label complexity compared to the passive learning 
method of Audibert and Tsybakov [2] ; again, the results for this method based on 
a surrogate loss are actually better than those derived from existing active learn- 
ing methods designed to directly optimize the 0-1 loss. The works of Audibert and 
Tsybakov [2] and Minsker [28] raise interesting questions about whether the gen- 
eral analyses of methods that optimize the 0-1 loss remain tight under complexity 
assumptions on the regression function, and potentially also about the design of 
optimal methods for classification when assumptions are phrased in terms of the 
regression function. 

In the present work, we focus our attention on scenarios where the main purpose 
of using the surrogate loss is to ease the computational problems associated with 
minimizing an empirical risk, so that our statistical results are typically strongest 
when the surrogate loss is the 0-1 loss itself. Thus, in the specific scenarios studied 
by Minsker [28], our results are generally not optimal; rather, the main strength 
of our analysis lies in its generality. In this sense, our results are more closely 
related to those of Bartlett, Jordan, and McAuliffe [6] and Zhang [38] than to those 
of Audibert and Tsybakov [2] and Minsker [28]. That said, we note that several 
important elements of the design and analysis of the active learning method below 
are already present to some extent in the work of Minsker [28]. 

There are several interesting works on active learning methods that optimize a 
general loss function. Beygelzimer, Dasgupta, and Langford [8] and Koltchinskii 
[25] have both proposed active learning methods, and analyzed the number of la- 



4 



HANNEKE AND YANG 



bel requests the methods make before achieving a given excess risk for that loss 
function. The former method is based on importance weighted sampling, while the 
latter makes clear an interesting connection to local Rademacher complexities. One 
natural idea for approaching the problem of active learning with a surrogate loss is 
to run one of these methods with the surrogate loss. The results of Bartlett, Jordan, 
and McAuliffe [6] allow us to determine a sufficiently small value 7 such that any 
function with excess surrogate risk at most 7 has excess error rate at most e. Thus, 
by evaluating the established bounds on the number of label requests sufficient for 
these active learning methods to achieve excess surrogate risk 7, we immediately 
have a result on the number of label requests sufficient for them to achieve excess 
error rate e. This is a common strategy to constructing and analyzing passive learn- 
ing algorithms that make use of a surrogate loss. However, as we discuss below, 
this strategy does not generally lead to the best behavior in active learning, and 
often will not be much better than simply using a related passive learning method. 
Instead, we propose a new method that typically does not optimize the surrogate 
risk, but makes use of it in a different way so as to achieve stronger results when 
performance is measured under the 0-1 loss. 

2. Definitions. Let (X,Bx) be a measurable space, where X is called the 
instance space; for convenience, we suppose this is a standard Borel space. Let 
y = {— 1,+1}, and equip the space X x y with its product cr-algebra: B = 
Bx ® 2^. Let M = KU {— 00, 00}, let P* denote the set of all measurable functions 
g : X — > R, and let F C P* , where P is called the function class. Throughout, we 
fix a distribution Vxy over X x y, and we denote by V the marginal distribution 
of Vxy over X. In the analysis below, we make the usual simplifying assumption 
that the events and functions in the definitions and proofs are indeed measurable. 
In most cases, this holds under simple conditions on P and Vxy [see e.g., 34]; 
when this is not the case, we may turn to outer probabilities. However, we will not 
discuss these technical issues further. 

For any h € P* , and any distribution P over X x y, denote the error rate by 
ei{h;P) = P((x,y) : sign(/i(x)) 7^ y); when P = Vxy, we abbreviate this as 
er(fc) = er(h;V X Y)- Also, let r](X;P) be a version of F(Y = l|X),for (X,Y) ~ 
P; when P = Vxy, abbreviate this as rj(X) = T]{X;Vxy)- In particular, note 
that er(/i; P) is minimized at any h with sign(/i(x)) = sign(r/(x; P) — 1/2) for all 
x E X. In this work, we will also be interested in certain conditional distributions 
and modifications of functions, specified as follows. For any measurable U C X 
with"P(W) > 0, define the probability measure V u {-) = V X Y{-\U*y) = V X y(-<~) 
U x y) /V(U): that is, Vu is the conditional distribution of (X, Y) ~ Vxy given 
that Also, for any h,g € F*, define the spliced function hu, g {x) = 

h(x)lu(x) + g(x)l x \ u (x). For a set H C J 7 *, denote %u,g = {^u,g '■ h G %}. 



SURROGATE LOSSES 



5 



For any % C J 7 *, define the region of sign-disagreement DIS('H) = {x G X : 
3h,g G % s.t. sign(/i(x)) / sign(g(x))}, and the region of value-disagreement 
DISF(-H) = {x € X : 3h,g 6 U s.t. h(x) / g(x)}, and denote by DlS(ft) = 
DlS(ft) x 3; and DTSF(^) = DISF(%) x y. Additionally, we denote by [%) = 
{/ G J 7 * : Vx g , inf^g-^ < /(x) < sup ftg -^ /i(x)} the minimal bracket 
set containing 

Our interest here is learning from data, so let Z = {{X\, Y\), (X2, l^), ■ ■ ■} 
denote a sequence of independent Vxy -distributed random variables, referred to 
as the labeled data sequence, while {X±, X 2 , ■ ■ ■} is referred to as the unlabeled 
data sequence. For m G N, we also denote Z m = {(Xi,Yi), . . . , (X m ,Y m )}. 
Throughout, we will let 5 G (0, 1/4) denote an arbitrary confidence parameter, 
which will be referenced in the methods and theorem statements. 

The active learning protocol is defined as follows. An active learning algorithm 
is initially permitted access to the sequence X\,Xi,. .. of unlabeled data. It may 
then select an index i\ G N and request to observe Y^; after observing Y^, it may 
select another index 12 G N, request to observe Yj 2 , and so on. After a number 
of such label requests not exceeding some specified budget n, the algorithm halts 
and returns a function h G F* . Formally, this protocol specifies a type of map- 
ping that maps the random variable Z to a function h, where h is conditionally 
independent of Z given X 1 ,X 2 ,... and Y h ), (i 2 , Y i2 ), . . . , (i n , Y in ), where 
each ik is conditionally independent of Z and ik+i, ■ ■ ■ ,i n given X±, X2, . . . and 
(«i,l r i 1 ),... l (ifc-i > i r i Jk _ 1 ). 

2.1. Surrogate Loss Functions for Classification. Throughout, we let I : H — > 
[0, 00] denote an arbitrary surrogate loss function; we will primarily be interested 
in functions I that satisfy certain conditions discussed below. To simplify some 
statements below, it will be convenient to suppose z G M =>• £(z) < 00. For any 
g G T* and distribution P over X x y, let R e (g; P) = E [£(g(X)Y)], where 
(X, Y) ~ P; in the case P = Vxy, abbreviate Re(g) = ~R,e{g; Vxy)- Also define 
I = 1 V sup xe x suPheF max j/e {- 1,4-1} £{yh(x))\ we will generally suppose £ < 
00. In practice, this is more often a constraint on T than on £; that is, we could have 
I unbounded, but due to some normalization of the functions h G F, £ is bounded 
on the corresponding set of values. 

Throughout this work, we will be interested in loss functions £ whose point-wise 
minimizer necessarily also optimizes the 0-1 loss. This property was nicely char- 
acterized by Bartlett, Jordan, and McAuliffe [6] as follows. For 7/0 G [0, 1], define 
t(r) ) = mf z ^(r] £{z) + (l-r]o)£(-z)), and^M = mf zefi:z{2rto _i)<o(r/o^) 
+ {l- m )£{-z)). 



Definition 1. The loss £ is classification-calibrated if Vr? G [0, 1] \ {1/2}, 
£t(r]o)>£*(vo). o 



6 



HANNEKE AND YANG 



In our context, for X ~ V, £*(rj(X)) represents the minimum value of the 
conditional ^-risk at X, so that E[£* (r](X))] = ini heJ 7* R e (h), while £*_(v(X)) 
represents the minimum conditional ^-risk at X, subject to having a sub-optimal 
conditional error rate at X: i.e., sign(/i(X)) / sign(?/(X) — 1/2). Thus, being 
classification-calibrated implies the minimizer of the conditional ^-risk at X nec- 
essarily has the same sign as the minimizer of the conditional error rate at X. 
Since we are only interested here in using £ as a reasonable surrogate for the 0-1 
loss, throughout the work below we suppose I is classification-calibrated. 

Though not strictly necessary for our results below, it will be convenient for us 
to suppose that, for all r/Q G [0, 1], this infimum value £*{t]q) is actually obtained as 
r/o£(z* (?7o)) + (1 — Vo)£{— for some z*(r]o) G R (not necessarily unique). 
For instance, this is the case for any nonincreasing right-continuous I, or contin- 
uous and convex £, which include most of the cases we are interested in using as 
surrogate losses anyway. The proofs can be modified in a natural way to handle the 
general case, simply substituting any z with conditional risk sufficiently close to 
the minimum value. For any distribution P, denote f P (x) = z*(r](x; P)) for all 
x G X. In particular, note that fp obtains R^(/p; P) = mi g& j?* Ke(g; P). When 
P = Vxy, we abbreviate this as /* = fp XY - Furthermore, if £ is classification- 
calibrated, then sign(/p(x)) = sign(?](x; P) — 1/2) for all x G X with r/(x; P) ^ 
1/2, and hence er(/p; P) = inf^jr* er(/i; P) as well. 

For any distribution P over X x y, and any h, g G F* , define the loss distance 



B e (h, g; P) = J E \{£(h{X)Y) - £{g(X)Y)) 2 ^ , where (X, Y) ~ P. Also define 

the loss diameter of a class H C F* as D^(%; P) = sup h 9gW De(h, g; P), and the 
^-risk e-minimal set ofH as U{e; £, P) = {h G U : R e (h; P)-mi geH R e (g; P) < 
e}. When P = Vxy, we abbreviate these as De(h, g) = D^(/t, g; Vxy), Ev('H) = 
D e (H;V X Y), and U(e;i) = U{s\£,Vxy)- Also, for any h G F*, abbreviate 
hu = huj*, and for any Ti C F* , define ^ = {/i^ : /i G 

We additionally define related quantities for the 0-1 loss, as follows. Define the 
distance Ap(h,g) = V(x : sign(/i(x)) / sign(g(x))) and radius radius(%; P) = 
sup ftgW Ap(h, fp). Also define the e-minimal set of Ti as %(e;oi,P) = {h G 
% : er(/i; P) — inf 5G % er(g; P) < e}, and for r > 0, define the r-ball cen- 
tered at h in U by B HtP (h,r) = {g £ H : Ap(h,g) < r}. When P = P X y, 
we abbreviate these as A(h,g) = A-p XY (h,g), radius(%) = radius(%; Vxy), 
H(s;oi) = H(e;oi,Vxy), and By_(h,r) = B^i t p XY {h,r); when % = F, further 
abbreviate B(h, r) = Bjr(/j, r). 

We will be interested in transforming results concerning the excess surrogate 
risk into results on the excess error rate. As such, we will make use of the following 
abstract transformation. 



SURROGATE LOSSES 



7 



DEFINITION 1. For any distribution P over X x y, and any e 6 [0, 1], define 
T e (e; P) = sup{ 7 > : P*( 7 ; £, P) C P*(e; oi, P)} U {0}. 
Also, for any 7 € [0, 00), define the inverse 

£ < (7;P)=inf{e>0: 7 <r < (e;P)}. 
Mien P = Pxy. abbreviate r^(e) = r^(e;Pxy) ^(7) = £^(7j^ > xr)- 

o 

By definition, for classification-calibrated T^ has the property that 

(1) V/i e P*, Ve G [0, 1], R,(/i) - RKD < r/(e) => ev(h) - er(/*) < e. 

In fact, is defined to be maximal with this property, in that any T' e for which (1) 
is satisfied must have T^(e) < r^(e) for all e € [0, 1]. 

In our context, we will typically be interested in calculating lower bounds on R* 
for any particular scenario of interest. Bartlett, Jordan, and McAuliffe [6] studied 
various lower bounds of this type. Specifically, for £ E [—1,1], define ipe(C) = 

£*_ (^^-j^J —£* \ ' an< ^ ^ et ^ ^ e tne ^ ar E> esi convex lower bound of tpi on [0, 1], 
which is well-defined in this context [6]. Bartlett, Jordan, and McAuliffe [6] show 
ipe is continuous and nondecreasing on (0, 1), and in fact that x h-> ipg (x) /x is 
nondecreasing on (0, 00). They also show every h G P* has ^(er(/i) — er(/*)) < 
R^(/i) — R^(/*), so that tp£ < T^, and they find this inequality can be tight for 
a particular choice of Vxy- They further study more subtle relationships between 
excess ^-risk and excess error rate holding for any classification-calibrated I. In 
particular, following the same argument as in the proof of their Theorem 3, one 
can show that if I is classification-calibrated, every h G P* satisfies 

The implication of this in our context is the following. Fix any nondecreasing func- 
tion : [0, 1] ->■ [0, 00) such that Ve > 0, 

(2) */(e) < radius(P*(e;oi))^ 



2radius(P*(e; 01)) 



Any h £ P* with R*(fc) - R,(/*) < ¥<(e) also has A(h, f*)^ ( g^ffi ) < 



2A(/i,/*) 

^(e); combined with the fact that x 1— )• ipi{ x )/ x is nondecreasing on (0, 00), this 
implies radius(P*(er(/ l ) - er(/*);«0Wi ( ^(^S-CV)^)) ) < ^ (e); 



8 



HANNEKE AND YANG 



this means tyg(ei(h) — er(/*)) < 5^(e), and monotonicity of implies ev(h) — 
er(/*) < e. Altogether, this implies \E^(e) < r^(e). In fact, though we do not 
present the details here, with only minor modifications to the proofs below, when 
/* G F, all of our results involving T^(e) will also hold while replacing Yi{e) with 
any nondecreasing such that Me > 0, 

(3, ,J W <^^ M)) *(__^__), 

which can sometimes lead to tighter results. 

Some of our stronger results below will be stated for a restricted family of losses, 
originally explored by Bartlett, Jordan, and McAuliffe [6]: namely, smooth losses 
whose convexity is quantified by a polynomial. Specifically, this restriction is char- 
acterized by the following condition. 

CONDITION 3. F is convex, with Vx G X, supy g jr |/(x)| < B for some 
constant B G (0, oo), and there exists a pseudometric di : [— B,B] 2 — > [0,di] 
for some constant dg G (0, oo), and constants L, Cg G (0, oo) and re G (0, oo] 
such that Vx,y G [— B, B], \£(x) — £(y)\ < Ld£(x,y) and the function 5e(e) 
= inf {\l{x) + \l{y) - l{\x + \y) : x,y G [-B, B], d e (x, y) > e} U {oo} sat- 
isfies Ve G (0, 1), 5 e (e) > C e e re . o 

In particular, note that if T is convex, the functions in F are uniformly bounded, 
and I is continuous, Condition 3 is always satisfied (though possibly with rg = oo). 

2.2. A Few Examples of Loss Functions. Here we briefly mention a few loss 
functions I in common practical use, all of which are classification-calibrated. 
These examples are taken directly from the work of Bartlett, Jordan, and McAuliffe 
[6], which additionally discusses many other interesting examples of classification- 
calibrated loss functions and their corresponding ifii functions. 

Example 1. The exponential loss is specified as i(x) = e~ x . This loss func- 
tion appears in many contexts in machine learning; for instance, the popular Ad- 
aBoost method can be viewed as an algorithm that greedily optimizes the expo- 
nential loss [13]. Bartlett, Jordan, and McAuliffe [6] show that under the expo- 
nential loss, ipe(x) = 1 — Vl — x 2 , which is tightly approximated by x 2 /2 for 
small x. They also show this loss satisfies the conditions on I in Condition 3 with 
de(x, y) = \x — y\, L = e B , Cf = e~ B /8, and = 2. 

Example 2. The hinge loss, specified as £(x) = max {1 — x, 0}, is another com- 
mon surrogate loss in machine learning practice today. For instance, it is used in 
the objective of the Support Vector Machine (along with a regularization term) 
[10]. Bartlett, Jordan, and McAuliffe [6] show that for the hinge loss, ipt(x) = \x\. 



SURROGATE LOSSES 



9 



The hinge loss is Lipschitz continuous, with Lipschitz constant 1. However, for the 
remaining conditions on t in Condition 3, any x, y < 1 have \t{x) + \l(y) = 
i{\x + \y), so that 5e(e) = 0; hence, r# = oo is required. 

Example 3. The quadratic loss (or squared loss), specified as £(x) = (1 — x) 2 , 
is often used in so-called plug-in classifiers [2], which approach the problem of 
learning a classifier by estimating the regression function E[F \X = x] = 2rj(x) — 
1, and then taking the sign of this estimator to get a binary classifier. The quadratic 
loss has the convenient property that for any distribution P over X x y, fp(-) = 
2r](-;P) — 1, so that it is straightforward to describe the set of distributions P 
satisfying the assumption /J> G T. Bartlett, Jordan, and McAuliffe [6] show that 
for the quadratic loss, ipe(x) = x 2 . They also show the quadratic loss satisfies the 
conditions on I in Condition 3, with L = 2(B + 1), Cg = 1/4, and = 2. In fact, 
they study the general family of losses £(x) = |1 — x\ p , for p G (1, oo), and show 
that tpg(x) and rg exhibit a range of behaviors varying with p. 

Example 4. The truncated quadratic loss is specified as l(x) = (max{l— x, 0}) 2 . 
Bartlett, Jordan, and McAuliffe [6] show that in this case, ipe{x) = x 2 . They also 
show that, under the pseudometric dg(a, b) = | min{a, 1} — min{6, 1}|, the trun- 
cated quadratic loss satisfies the conditions on I in Condition 3, with L = 2(B + 1), 
Ct = 1/4, and = 2. 

2.3. Empirical £-Risk Minimization. For any m G N, g : X — > R, and S = 
{( x i, yi)i ■ ■ ■ ) ( x mi Vm)} £ {X x y) m , define the empirical t-risk as Ri(g; S) = 
m YliLi K9( x i)Ui)- At times it will be convenient to keep track of the indices 
for a subsequence of Z, and for this reason we also overload the notation, so 
that for any Q = {(n, yi ), . . . , (i m , y m )} £ (N x y) m , we define S[Q] = 
{(X il ,yi),...,(X im ,y m )} and Ri(g;Q) = Re(g; S[Q]). For completeness, we 
also generally define Ri(g; 0) = 0. The method of empirical £-risk minimization, 
here denoted by ERM^(^, Z m ), is characterized by the property that it returns 
h = argmin fce ^ Ri(h; Z m ). This is a well-studied and classical passive learning 
method, presently in popular use in applications, and as such it will serve as our 
baseline for passive learning methods. 

2.4. Localized Sample Complexities. The derivation of localized excess risk 
bounds can essentially be motivated as follows. Suppose we are interested in bound- 
ing the excess ^-risk of ERM^(%, Z m ). Further suppose we have a coarse guaran- 
tee Ue{l-L,m) on the excess £-risk of the h returned by ERM^(%, Z m ): that is, 
Re(h) — Rg(f*) < Ue(H,m). In some sense, this guarantee identifies a set %' C 
H. of functions that a priori have the potential to be returned by ERM^(%,i? m ) 
(namely, %' = T-L(Ui{T-l, m);£)), while those m'H\'H' do not. With this informa- 
tion in hand, we can think of %' as a kind of effective function class, and we can 



10 



HANNEKE AND YANG 



then think of ERM^(%, Z m ) as equivalent to ERM^("H', Z m ). We may then repeat 
this same reasoning for ERM^(%', Z m ), calculating Ue(H', m) to determine a set 
%" = T-L'(Ue(H',m);£) C %' of potential return values for this empirical mini- 
mizer, so that ERM^', Z m ) = ERM^H" ,Z m ), and so on. This repeats until 
we identify a fixed-point set of functions such that 'H ( -°°\Ue('H ( -°°\m);£) 
= so that no further reduction is possible. Following this chain of reasoning 

back to the beginning, we find that ERM^(%, Z m ) = ERM e (n^°°\ Z m ), so that 
the function h returned by ERM^(%, Z m ) has excess £-risk at most Ue(7i ( -°°\m), 
which may be significantly smaller than Ui{T-L,m), depending on how refined the 
original Uz(H,m) bound was. 

To formalize this fixed-point argument for ERMe(H, Z m ), Koltchinskii [23] 
makes use of the following quantities to define the coarse bound Ui(H,m) [see 
also 7, 15]. For any % C [J 7 ], m G N, s G [l,oo), and any distribution P on 
X x y, letting Q ~ P m , define 



,(H;m,P) = E 



sup (R e (h; P) - R e (g; P)) - (R e (h; Q) - R e (g; Q)) 

h,g€H 



U e (H; P, m, s) = K 1( f> £ (n;m, P) + K 2 D e (H; P)J- + I 



m m 



U t (H; P, m, s)=K( <j>t(Ji\ m, P) + D e (H; P)J - + - 



m m 



where K\, K 2 , K3, and K are appropriately chosen constants. 

We will be interested in having access to these quantities in the context of our 
algorithms; however, since Vxy is not directly accessible to the algorithm, we 
will need to approximate these by data-dependent estimators. Toward this end, 
we define the following quantities, again taken from the work of Koltchinskii 
[23]. For e > 0, let Z e = {j G Z : 2? > e}. For any U C \T\ q G N, 
and S = {(x uyi ),...,(x q ,y q )} G (X x {-1,+1}) 9 , let?£(e;£,5) = {/1 G 
H : Re(h;S) — mfg^R^g; S) < e}; then for any sequence 3 = {£, k } 9 k=1 G 
{— 1, +l} q , and any s G [1, 00), define 

1 q 

<f>i{H;S,E) = sup -^6c • (£(h(x k )y k ) - £{g{x k )y k )) , 



fe=i 
3 



1 

b e (H;S) 2 = sup - V" (£{h(x k )y k ) - l{g{x k )y k )) 2 , 
h,g&H Q ~ 

U t {%; S, 3, s) = 12MH; S, 3) + 34D,(^; S)J- + — — . 



SURROGATE LOSSES 



11 



For completeness, define <^(%;0,0) = i>t(H;Q) = 0, and ?7 f (ft; 0, 0, s) = 
752£s. 

The above quantities (with appropriate choices of K\, K2, K3, and K) can be 
formally related to each other and to the excess £-risk of functions in ft via the 
following general result; this variant is due to Koltchinskii [23]. 

LEMMA 4. For any ft C [J 7 ], s G [l,oo), distribution P over X x y, and 
any m G N, 1/ Q ~ P m = {&,..., f m } ~ Uniform({-l, +l}) m are 

independent, and h* has R^(/i*; P) = inf^g-^ R^(/i; P), f/ze« w/fft probability 
at least 1 — 6e _s , the following claims hold. 

V/i G ft, R^/i; P) - R^(/i*; P) < Rt(h; Q) - Re(h*;Q) + U t (H; P, m, s), 
Vh G H,RAh; Q) - inf R £ ( 5 ; Q) < R//i; P) - R^(/i*; P) + UAH; P, m, s), 

U £ (H; P, m, a) < U t {U\ Q, S, s) < ^(ft; P, m, a). 

o 

We typically expect the P, P, and P quantities to be roughly within constant fac- 
tors of each other. Following Koltchinskii [23] and Gine and Koltchinskii [15], we 
can use this result to derive localized bounds on the number of samples sufficient 
for ERM^ft, -£ m ) to achieve a given excess Prisk. Specifically, for ft C [J 7 ], 
distribution P over X x y, values 7,71,72 > 0, s G [l,oo), and any function 
s : (0, oo) 2 — > [1, 00), define the following quantities. 

M £ ( 7l ,7 2 ;^,Ps) = min{mGN: U e (U(>yr,e,P))P,m,s) < 7 i}, 
M*( 7 ;ft,P,s) = sup M e (j'/2, 7 ';H,P, 5(7,7')), 

M £ ( 7 i, 725 ^, P s) = min {m G N : Utfi{?vA P)l P m, s) < 71} , 

M^( 7 ;ft,P,s) = sup M,(772,7';^,Ps(7,7'))- 

7' ^7 

These quantities are well-defined for 71, 72, 7 > when lim m _ s>00 <pe(H; m, P) = 
0. In other cases, for completeness, we define them to be 00. 

In particular, the quantity M^(7; T, Vxy-,%) is used in Theorem 6 below to 
quantify the performance of ERM^(P, Z m ). The primary practical challenge in 
calculating M^( 7 ;ft,P,s) is handling the (^(ft( 7 '; £, P); m, P) quantity. In the 
literature, the typical (only?) way such calculations are approached is by first de- 
riving a bound on <^(ft'; m, P) for every ft' C ft in terms of some natural mea- 
sure of complexity for the full class ft (e.g., entropy numbers) and some very basic 
measure of complexity for ft': most often Dv(ft'; P) and sometimes a seminorm 



12 



HANNEKE AND YANG 



of an envelope function for %' . After this, one then proceeds to bound these basic 
measures of complexity for the specific subsets T-L(^';£, P), as a function of 7'. 
Composing these two results is then sufficient to bound ^(^(7'; £, P); m, P). For 
instance, bounds based on an entropy integral tend to follow this strategy. This 
approach effectively decomposes the problem of calculating the complexity of 
%{ r f\ i, P) into the problem of calculating the complexity of % and the problem 
of calculating some much more basic properties of £, P). See [6, 15, 23, 35], 
or Section 5 below, for several explicit examples of this technique. 

Another technique often (though not always) used in conjunction with the above 
strategy when deriving explicit rates of convergence is to relax D^(H (V; £, P)',P) 
to T> t {F*{i; £, P); P) or D e ([H](-f'; £, P);P). This relaxation can sometimes be 
a source of slack; however, in many interesting cases, such as for certain losses £ 
[e.g., 6], or even certain noise conditions [e.g., 27, 33], this relaxed quantity can 
still lead to nearly tight bounds. 

For our purposes, it will be convenient to make these common techniques ex- 
plicit in the results. In later sections, this will make the benefits of our proposed 
methods more explicit, while still allowing us to state results in a form abstract 
enough to capture the variety of specific complexity measures most often used in 
conjunction with the above approach. Toward this end, we have the following def- 
inition. 

DEFINITION 5. For every distribution P over X x y, let 4>i(o, H; m, P) be 
a quantity defined for every G [0,oo], H C and m G N, such that the 
following conditions are satisfied when fp G H. 

IfO<o< o', «Ctfc[J],«a, and rri < m, 

(4) thenfa(o-,H u ,f*;m,P) < fa(a' ,H';m' , P). 

(5) W > D e (H;P),<t>iCrl;m,P) < fa(o,H;m,P). 

o 

For instance, most bounds based on entropy integrals can be made to satisfy this. 
See Section 5.3 for explicit examples of quantities (p£ from the literature that satisfy 
this definition. Given a function $1 of this type, we define the following quantity 

for m € N, s G [1, 00), ( G [0, 00], % C [J 7 ], and a distribution P over X x y. 

= K ( MDeWKC; t, PY, P),H; m, P) + D e ([H]((; £, P); P)*[^ + -Y 
\ V m m j 

Note that when fp G H, since ( [H] (7; £, P) ; P) > D e (H(r, ^ p )> p )> Defini- 
tion 5 implies Mnr, £, P);m, P) < MB e ([H} (7; £, P); P),H(r, ^ P)\ P m ), 



SURROGATE LOSSES 13 

and furthermore U (7; £, P) C U so that ^ (D* ([?£] (7; £, P) ; P) , H(r, A -P) ; P, m) 
< fa(D t ([H](r, i, P)\ P),K; P, m). Thus, 

(6) U £ (H(r,^P);P,m,s) < U t (H{r,t,P),T,P,m,s) < U t (U,r,P,m,s). 

Furthermore, when fp G 7~L, for any measurable U QU' Q X, any 7' > 7 > 0, 
and any %' C [J] with % C 

(V) Ue(Hu,f*,r,P,rn,s) < U^j^'-P^s). 

Note that the fact that we use D £ ([ft] (7; £, P); P) instead of D*(«(7; £, P); P) in 
the definition of Ug is crucial for these inequalities to hold; specifically, it is not 
necessarily true that D e (H u ,f* p (r,t,P);P) < ^t{H U 'j%{r,^P);P), but it is 
always the case that [H u ,f*](r, £, P) Q [Hu> ,f*](r, Z, P) when f* G [H], so that 

Be([H U j* p ](rJ,Py,P) < B e (lH W j p ](r,t,Py,P)- 

Finally, for % C [P], distribution P over X x y, values 7,71,72 > 0, s G 
[1, 00), and any function s : (0, oo) 2 — > [1, 00), define 

M £ ( 7 i, 72 ; H, P, s) = min {m G N : 17* (W, 72; P,m,s) < 71} , 

MtfrMP,*) = sup M^ 7 72,7 / ;W,P,s( 7 ,7 / )). 

For completeness, define M^(7i, 72; 'H, P, s) = 00 when Ue{H, 72; P, m, s) > 71 
for every m G N. 

It will often be convenient to isolate the terms in Ug when inverting for a suffi- 
cient m, thus arriving at an upper bound on M*. Specifically, define 

M,(7i,72;^,P,s) = mm(mGN:D,([^](7 2 ;AP);P) A /- + - <7i), 

[ V m m J 

M^( 7 i, 72; H, P) = min {m G N : & (D^ft]^;*, P); P), H; P, m) < 7l } . 
This way, for c = 1/(2K), we have 

(8) M^(7!, 72 ; H, P, s) < max {m*(c 7 i, 72 ; W, P), M*(c 7 i, 72;^, P, s)} . 
Also note that we clearly have 

™ lOrc ay P w f 4D,([^](7 2 ;l,P);l,P) 2 2£\ 

(9) M £ (7i,7 2 ;H,P,s) < s-max = , — } , 

I 7i 7i J 

so that, in the task of bounding M^, we can simply focus on bounding M*. 

We will express our main abstract results below in terms of the incremental 
values M£(ji,72;H,Pxy, s); the quantity M^(7;%,Pxy,s) will also be useful 
in deriving analogous results for ERM^. When fp G 7~L, (6) implies 

(10) M e ( r ,u,p,s) < M e ( r ,n,p,$) < M e ( r ,n,p,s). 



14 



HANNEKE AND YANG 



3. Methods Based on Optimizing the Surrogate Risk. Perhaps the simplest 
way to make use of a surrogate loss function is to try to optimize Ke(h) over h G F, 
until identifying h G F with Rg(h) — R^(/*) < r^(e), at which point we are 
guaranteed er(h) — er(/*) < e. In this section, we briefly discuss some known 
results for this basic idea, along with a comment on the potential drawbacks of this 
approach for active learning. 

3.1. Passive Learning: Empirical Risk Minimization. In the context of passive 
learning, the method of empirical £-risk minimization is one of the most-studied 
methods for optimizing Ke(h) over h G F. Based on Lemma 4 and the above def- 
initions, one can derive a bound on the number of labeled data points m sufficient 
for EKMi(F, Z m ) to achieve a given excess error rate. Specifically, the following 
theorem is due to Koltchinskii [23] (slightly modified here, following Gine and 
Koltchinskii [15], to allow for general s functions). It will serve as our baseline for 
comparison in the applications below. 

THEOREM 6. Fix any function s : (0,oo) 2 — >• [l,oo). Iff* G F, then for any 
m > M £ (T e (e);F,VxY,5), with probability at least 1 - £\- eZ Q e -^ T ^ £) ^\ 

ERM^(J 7 , Z m ) produces a function h such that ei(h) — er(/*) < e. o 

3.2. Negative Results for Active Learning. As mentioned, there are several ac- 
tive learning methods designed to optimize a general loss function [8, 25]. How- 
ever, it turns out that for many interesting loss functions, the number of labels 
required for active learning to achieve a given excess surrogate risk value is not 
significantly smaller than that sufficient for passive learning by ERM^. 

Specifically, consider a problem with X = {xq, x\}, let z G (0, 1/2) be a con- 
stant, and for e G (0,z), let V{{xi}) = e/(2z), V({x }) = 1 - V{{xi\), and 
suppose F and I are such that for 77(2:1) = 1/2 + 2 and any r\{x§) G [4/6, 5/6], 
we have /* G F. For this problem, any function h with siga(h(xi)) 7^ +1 
has er(/i) - er(/*) > e, so that r*(e) < (e/{2z)){£ k _(r)(x 1 )) - t{r]{xi))); 
when I is classification-calibrated and I < 00, this is ce, for some ^-dependent 
c G (0, 00). Any function h with R^(/i) — R^(/*) < ce for this problem must have 
Re(h;P{ X0 }) ~ Re{f*;V{ xa }) < ce/V({x }) = 0(e). Existing results of Han- 
neke and Yang [21] (with a slight modification to rescale for rj(xo) G [4/6,5/6]) 
imply that, for many classification-calibrated losses I, the minimax optimal num- 
ber of labels sufficient for an active learning algorithm to achieve this is 0(l/e). 
Hanneke and Yang [21] specifically show this for losses I that are strictly posi- 
tive, decreasing, strictly convex, and twice differentiable with continuous second 
derivative; however, that result can easily be extended to a wide variety of other 
classification-calibrated losses, such as the quadratic loss, which satisfy these con- 
ditions in a neighborhood of 0. It is also known [6] (see also below) that for many 



SURROGATE LOSSES 



15 



such losses (specifically, those satisfying Condition 3 with vi = 2), 0(l/e) ran- 
dom labeled samples are sufficient for ERM^ to achieve this same guarantee, so 
that results that only bound the surrogate risk of the function produced by an active 
learning method in this scenario can be at most a constant factor smaller than those 
provable for passive learning methods. 

In the next section, we provide an active learning algorithm and a general anal- 
ysis of its performance which, in the special case described above, guarantees ex- 
cess error rate less than e with high probability, using a number of label requests 
0(log(l/e) log log(l/e)). The implication is that, to identify the improvements 
achievable by active learning with a surrogate loss, it is not sufficient to merely 
analyze the surrogate risk of the function produced by a given active learning algo- 
rithm. Indeed, since we are not particularly interested in the surrogate risk itself, we 
may even consider active learning algorithms that do not actually optimize R^(/i) 
over h € F (even in the limit). 

4. Alternative Use of the Surrogate Loss. Given that we are interested in I 
only insofar as it helps us to optimize the error rate with computational efficiency, 
we should ask whether there is a method that sometimes makes more effective use 
of I in terms of optimizing the error rate, while maintaining essentially the same 
computational advantages. The following method is essentially a relaxation of the 
methods of Koltchinskii [25] and Hanneke [20]. Similar results should also hold for 
analogous relaxations of the related methods of Balcan, Beygelzimer, and Langford 
[3], Dasgupta, Hsu, and Monteleoni [11], Balcan, Beygelzimer, and Langford [4], 
and Beygelzimer, Dasgupta, and Langford [8]. 

Algorithm 1: 

Input: surrogate loss i, unlabeled sample budget u, labeled sample budget n 
Output: classifier h 

0. V <- T, Q <- 0, m <- 0, t <- 0, k <- 1, mi <- 0, 71 <- £ 

1. While m < u and t < n 

2. m -<— m + 1 

3. ffX m €DIS(V) 

4. Request label Y m and let Q <- Q U {(m, Y m )}, t <- t + 1 

5. Iflog 2 (m-m fe ) en and f e (V;Q,m,k)^- k <%/2 

6- %+i <r- f £ (V; Q, to, k) j^J^ , m k+ i <- m 

1. V |/i £ V : R e (h; Q) - mf geV R e (g; Q) < f e (V; Q, m, k)} 

8. Q^$,k^k + 1 

9. Return h = argmin/j g y Ke(h; Q) 



16 



HANNEKE AND YANG 



The intuition behind this algorithm is that, since we are only interested in achiev- 
ing low error rate, once we have identified sign(/*(x)) for a given x G X, there 
is no need to further optimize the value K[£(h(X)Y)\X = x\. Thus, as long as 
we maintain /* G V, the data points X m ^ DIS(y) are typically less informa- 
tive than those X m G DIS(V). We therefore focus the label requests on those 
X m G DIS(V), since there remains some uncertainty about sign(/*(X m )) for 
these points. The algorithm updates V once enough samples have accumulated to 
estimate the excess risks under the current sampling distribution up to some de- 
sired precision. This update (Step 7) essentially removes from V those functions 
h whose excess empirical risks (under the current sampling distribution) are rela- 
tively large; by setting this threshold Xg appropriately, we can guarantee the excess 
empirical risk of /* is smaller than Tg. Thus, the algorithm maintains /* G V as 
an invariant, while focusing the sampling region 018(1/). 

In practice, the set V can be maintained implicitly, simply by keeping track of 
the constraints (Step 7) that define it; then the condition in Step 3 can be checked 
by solving two constraint satisfaction problems (one for each sign); likewise, the 
value mfg & v R^(<?; Q) in these constraints, as well as the final h, can be found by 
solving constrained optimization problems. The quantity Tg in Algorithm 1 can be 
defined in one of several possible ways. In our context, we consider the following 
definition. Let {^}fceN denote independent Rademacher random variables (i.e., 
uniform in {—1, +1}), also independent from Z; these should be considered inter- 
nal random bits used by the algorithm, which is therefore a randomized algorithm. 
For any q G N U {0} and Q = {(h, Vl ), . . . , (i q , y q )} G (N x {-1, +1})", let 
S[Q] = {(X h , yi ), (X iq ,y q )}, E[Q] = {£jLr For s G [1, oo), define 

Uz(H;Q,s) = Uz(H;S[Q},E[Q],s). 
Then we can define the quantity in the method above as 

(11) f e (H;Q,m,k) = Ue{H; Q,s(%, m - m k )), 

for some s : (0, oo) X N — > [l,oo). This definition has the appealing property 
that it allows us to interpret the update in Step 7 in two complementary ways: as 
comparing the empirical risks of functions in V under the conditional distribution 
given the region of disagreement T^dis^)' an d as comparing the empirical risks of 
the functions in Vois(y) under the original distribution Vxy- 

For convenience, we will also suppose the function s in (11) satisfies, V7 > 
and m G N, 

(12) s( 7 ,m) =5(2^°^\m) : 
so that we can effectively round 7 to a power of 2. 



SURROGATE LOSSES 



17 



We have the following theorem, which represents our main abstract result. The 
proof is included in Appendix A. 

THEOREM 7. For each j > -\log 2 (£)], letSj(-) = s(2~ J , -),fors satisfying 

(12) , let Fj = -7 r (££(2 1 " j );oi) D is(^(£ ( ,(2 1 -J);oi)). u j = ^> l ^(^j)' and let Uj G N 
satisfy log 2 (uj ) G N and 

(13) Uj > M £ (2^ 2 ,2 1 ^'; Fj,V X Y,Sj(uj)). 
Suppose f* G J 7 . For any e G (0, 1), awe? s G [1, oo), /f 

Lio g2 (2/r f (e))J Liog 2 (2/r f (e))J 
it > Uj and n > s + 2e V(Uj)uj, 

i=-rio g2 (£)i i=-riog 2 ®i 

f/zen, wzY/i arguments £, u, and n, Algorithm 1 uses at most u unlabeled samples 
and makes at most n label requests, and with probability at least 

Llog 2 (2/r £ (e))J log 2 (uj) 

i-2- s - Yl E 6e ~ Sj{2l) > 

i=-riog 2 Wl i=l 

returns a function h with ev(h) — er(/*) < e. o 

The number of label requests indicated by Theorem 7 can often (though not 
always) be significantly smaller than the number of random labeled data points 
sufficient for ERJVLj to achieve the same, as indicated by Theorem 6. This is typi- 
cally the case when V{Uj) — > as j — > oo. When this is the case, the number of 
labels requested by the algorithm is sublinear in the number of unlabeled samples 
it processes; below, we will derive more explicit results for certain types of func- 
tion classes F, by characterizing the rate at which V(Uj) vanishes in terms of a 
complexity measure known as the disagreement coefficient. 

For the purpose of calculating the values in Theorem 7, it is sometimes 
convenient to use the alternative interpretation of Algorithm 1, in terms of sampling 
Q from the conditional distribution "Pdisoo- Specifically, the following lemma 
allows us to replace calculations in terms of Fj and Vxy with calculations in 
terms of < F(£^(2 1 ~'?'); oi) and "PdisP 7 ,)- I ts proof is included in Appendix A 

LEMMA 8. Let 4>i be any function satisfying Definition 5. Let P be any dis- 
tribution over X x y. For any measurable U C X x y with P(U) > 0, define 
P u (.) = P(-\U). Also, for any a > 0, U C [F], andm G N, ifP (DlSF(ft)) > 0, 



18 



HANNEKE AND YANG 



define 

(14) fy(a,n;m,P) 



( / X , nA 



32 



inf P(W)& \^==,H;\(l/2)P(U)m],P u ) + - + a 
u=u'xy-. \ JP(U) I m V m 

\WDmSF(H) v v 7 / 



an<i otherwise define <^(<t, T-L;m,P) = 0. 77ie?i the function also satisfies Defi- 
nition 5. o 

Plugging this 4>' e function into Theorem 7 immediately yields the following 
corollary, the proof of which is included in Appendix A. 

COROLLARY 9. For each j > — [log 2 (£)], let J-j, Uj, and Sj be as in Theo- 
rem 7, and ifV(JAj) > 0, let Uj G N satisfy log 2 (uj) G N and 

(15) ^ > 2P(^)" 1 M, l^L^Vu^i)) ■ 

IfV(Uj) = 0, to G N satisfy log 2 (uj) G N and Uj > KiSj(uj)2 j+2 . Suppose 
f* G T. For any e G (0, 1) and s G [1, oo), if 

Lio g2 (2/r i ( £ ))j Liog 2 (2/r £ (e))J 
u > u j an d n > s + 2e V(Uj)uj, 

i=-rio g2 wi j=-\io g2 (i)] 

then, with arguments i, u, and n, Algorithm 1 uses at most u unlabeled samples 
and makes at most n label requests, and with probability at least 

Liog 2 (2/r f ( £ ))j io g2 (u 3 ) 
l-2" s - Yl E 6e^ (21 ), 

j=-\\o g2 (Z)] i=l 

returns a function h with er(/i) — er(/*) < e. o 

Algorithm 1 can be modified in a variety of interesting ways, leading to related 
methods that can be analyzed analogously. One simple modification is to use a 
more involved bound to define the quantity Tg. For instance, for Q as above, and a 
function : (0, oo) x N — > [1, oo), one could define 

f e (H; Q,m, k) = (3/2)?" 1 inf | A > : Vj G Z A , 

U e (U {Sq^- 1 ;?, S[Q}) ;Q,5 k (Sq- 1 ^ 1 , m-m k ))< 2^~ V* } , 



SURROGATE LOSSES 



19 



for which one can also prove a result similar to Lemma 4 [see 15, 23]. This def- 
inition shares the convenient dual-interpretations property mentioned above about 
Ue(H] Q, s(%, m — rrifc)); furthermore, the results above for Algorithm 1 also hold 
under this definition (for appropriate Sk functions), with only minor modifications 
to constants and event probabilities. 

The update trigger in Step 5 can be modified in several ways, leading to interest- 
ing related methods. One simple change would be replacing it with log 2 (m) G N, 
as in the methods of Hanneke [20], which simplifies the algorithm to some ex- 
tent. In most applications of interest, this still yields a result similar to Theo- 
rem 7, since we might expect the value Me(2~i~ 2 ,2 1 ~i;J 7 j,VxY>Sj(uj)) to be 
at least twice as large as M^(2 _J " -1 , 2 2 ~ J ; Tj-i, VxY,Sj-i(uj-i)) anyway. An- 
other interesting possibility is to replace the last condition in Step 5 with a check 
for f e (V;Q,m,k)^^ < T t {2~ k ). Of course, the value r^(2~ fc ) is typically 
not directly available to us, but we could substitute a distribution-independent 
lower bound on r^(2 _fc ), for instance based on the tp£ function of Bartlett, Jor- 
dan, and McAuliffe [6]; in the active learning context, we could potentially use 
unlabeled samples to estimate a 'P-dependent lower bound on Tz(2~ k ), or even 
diam(l/) , 0£(2 _fc /2diam(F)), based on (3), where diam(V) = sup h g&v A(h,g). 

5. Applications. In this section, we apply the abstract results from above to a 
few commonly-studied scenarios: namely, VC subgraph classes and entropy con- 
ditions, with some additional mention of VC major classes and VC hull classes. 
In the interest of making the results more concise and explicit, we express them 
in terms of well-known conditions relating distances to excess risks. We also ex- 
press them in terms of a lower bound on T^(e) of the type in (2), with convenient 
properties that allow for closed-form expression of the results. To simplify the pre- 
sentation, we often omit numerical constant factors in the inequalities below, and 
for this we use the common notation f(x) < g(x) to mean that f(x) < cg(x) for 
some implicit universal constant c G (0, oo). 

5.1. Diameter Conditions. To begin, we first state some general characteriza- 
tions relating distances to excess risks; these characterizations will make it easier 
to express our results more concretely below, and make for a more straightforward 
comparison between results for the above methods. The following condition, intro- 
duced by Mammen and Tsybakov [27] and Tsybakov [33], is a well-known noise 
condition, about which there is now an extensive literature [e.g., 6, 19, 20, 23]. 

CONDITION 10. For some a G [1, oo) and a G [0, 1], for every g G J 7 *, 
A(g,n<a(er(g)-er(r)) a . 

o 



20 



HANNEKE AND YANG 



Condition 10 can be equivalently expressed in terms of certain noise conditions 
[6, 27, 33]. Specifically, satisfying Condition 10 with some a < 1 is equivalent to 
the existence of some a' £ [1, oo) such that, for all e > 0, 

V : \r]{x) - 1/2| < e) < a'e a /^~ a \ 

which is often referred to as a low noise condition. Additionally, satisfying Condi- 
tion 10 with a = 1 is equivalent to having some a' £ [1, oo) such that 

V (x : \r)(x) - 1/2| < l/a) = 0, 

often referred to as a bounded noise condition. 

For simplicity, we formulate our results in terms of a and a from Condition 10. 
However, for the abstract results in this section, the results remain valid under the 
weaker condition that replaces F* by F, and adds the condition that /* £ T . In 
fact, the specific results in this section also remain valid using this weaker condition 
while additionally using (3) in place of (2), as remarked above. 

An analogous condition can be defined for the surrogate loss function, as fol- 
lows. Similar notions have been explored by Bartlett, Jordan, and McAuliffe [6] 
and Koltchinskii [23]. 

CONDITION 11. For some b £ [l,oo) and /3 £ [0, 1], for every g £ [J 7 ], 
D, (g, r ; P) 2 < b (R e (g; P) - R £ (/*; P)f . 

o 

Note that these conditions are always satisfied for some values of a, b, a, (3, since 
a = (3 = trivially satisfies the conditions. However, in more benign scenarios, 
values of a and j3 strictly greater than can be satisfied. Furthermore, for some 
loss functions t, Condition 1 1 can even be satisfied universally, in the sense that 
a value of /3 > is satisfied for all distributions. In particular, Bartlett, Jordan, 
and McAuliffe [6] show that this is the case under Condition 3, as stated in the 
following lemma [see 6, for the proof]. 

LEMMA 12. Suppose Condition 3 is satisfied. Let (3 = min{l, ^} and b = 

{2C' t )~l 3 L 2 , where C' e = Cifor > 2, and C' e = Ced r e e ~ 2 otherwise. Then every 
distribution P over X x y with fp £ [J 7 ] satisfies Condition 11 with these values 
of b and f3. o 

Under Condition 10, it is particularly straightforward to obtain bounds on T^(e) 
based on a function ^e(e) satisfying (2). For instance, since x \-> xijj^l/x) is 
nonincreasing on (0, oo) [6], the function 

(16) tt*(e) = ae a i>i {e 1 - a /(2a)) 



SURROGATE LOSSES 



21 



satisfies ^(e) < r^(e) [6]. Furthermore, for classification-calibrated I, tyg in (16) 
is strictly increasing, nonnegative, and continuous on [0, 1] [6], and has 5^(0) = 0; 
thus, the inverse ^J 1 ^), defined for all 7 > by 

(17) *7 1 ( 7 ) = mf{e > : 7 < * t (e)} U {1}, 

is strictly increasing, nonnegative, and continuous on (0, ^(1)). Furthermore, one 
can easily show x \-¥ ^/J 1 (x)/x is nonincreasing on (0, 00). Also note that V7 > 

0,^(7) <^ _1 (7)- 

5.2. The Disagreement Coefficient. In order to more concisely state our re- 
sults, it will be convenient to bound 7 3 (DIS('H)) by a linear function of radius(%), 
for radius (H) in a given range. This type of relaxation has been used extensively 
in the active learning literature [5, 8, 11, 14, 17-20, 25, 26, 32, 37], and the coef- 
ficient in the linear function is typically referred to as the disagreement coefficient. 
Specifically, the following definition is due to Hanneke [17, 19]; related quantities 
have been explored by Alexander [1] and Gine and Koltchinskii [15]. 

Definition 13. For any ro > 0, define the disagreement coefficient of a 
function h : X — > R with respect to J- under V as 

„ , . VCDIS(B(h,r))) , 
9 h (r ) = sup -5s v v V 1. 

r>ro T 

If f* E T, define the disagreement coefficient of the class T as 9(ro) = 9f*(ro). 

o 

The value of 9(e) has been studied and bounded for various function classes 
T under various conditions on V. In many cases of interest, 9(e) is known to be 
bounded by a finite constant [5, 14, 17, 19, 26], while in other cases, 9(e) may have 
an interesting dependence on e [5, 32, 37]. The reader is referred to the works of 
Hanneke [19, 20] for detailed discussions on the disagreement coefficient. 

5.3. Specification ofcj)g. Next, we recall a few well-known bounds on the 
function, which leads to a more concrete instance of a function (pg satisfying Defi- 
nition 5. Below, we let Q* denote the set of measurable functions g : X x y — > R. 
Also, for Q C Q*, let F(Q) = sup g€ g \g\ denote the minimal envelope function for 
Q, and for g G Q* let \\g\\ 2 P = f g 2 dP denote the squared L,2(P) seminorm of g; 
we will generally assume F(Q) is measurable in the discussion below. 

Uniform Entropy: The first bound is based on the work of van der Vaart and Well- 
ner [34]; related bounds have been studied by Gine and Koltchinskii [15], Gine, 
Koltchinskii, and Wellner [16], van der Vaart and Wellner [35], and others. For a 



22 



HANNEKE AND YANG 



distribution P over X x y, a set Q C £*, and e > 0, let Af(e, Q, L 2 (P)) denote the 
size of a minimal e-cover of Q (that is, the minimum number of balls of radius at 
most e sufficient to cover Q), where distances are measured in terms of the L 2 (P) 
pseudo-metric: (/, g) h-> ||/ — g\\p. For a > and F G £?*, define the function 

J(a, G, F) = sup / Jl + lnM(e\\F\\ Q ,g,L 2 (Q))de, 
Q JO 

where Q ranges over all finitely discrete probability measures. 

Fix any distribution P over X x y and any % C [J 7 ] with fp£7i, and let 

5« = {(i l I,)4^(i)y):/ieH} 1 

(18) and = {(x, y) ^ £(/»(a:)|/) - ^/p(*)y) :kH). 

Then, since J (a, Gy.,F) = J (a, Gh,p> f )> it follows from Theorem 2.1 of van der 
Vaart and Wellner [34] (and a triangle inequality) that for some universal constant 

c G [1, oo), for any m G N, F > F(G n ,p), and <r > D*(ft; P), 

(19) fa(H;P,m)< 
a 



hn=ui-»^ F II f IIp ^ + 



l J pfe>^> F H F ^ 



m a 2 m 



Based on (19), it is straightforward to define a function 0^ that satisfies Definition 5. 
Specifically, define 

(20) $\a,n;m,P) = 

( A \„ „ / 1 

inf inf cJ -— ,Gn,F )\\F\\p \ -= + 



f>f(S«,p) a>ct \||F||p' ' / 1 -^/m A 2 m 

for c as in (19). By (19), (jfp satisfies (5). Also note that m t- > (j>y (a, H;m,P) is 
nonincreasing, while a h-> <j)p (a, Ti; m, P) is nondecreasing. Furthermore, % \- > 
M{e, Qu-, L 2 (Q)) is nondecreasing for all Q, so that % (->• J(cr, <5%, F) is nonde- 
creasing as well; since % h-> F(Gh,p) * s a l so nondecreasing, we see that % h-> 
(a,T-L;m,P) is nondecreasing. Similarly, for C A?, N(e,Gh,j « ,L 2 (Q)) 
< Af{e,Gn,L 2 (Q)) for all Q, so that J(a,Gn }J f *,F) < J(<r,0 w ,F); because 
F(^ w/ ,,p) < F(0 W , P ), we have (a,n u ,f*;rn, P) < $\a,H;m,P) as 
well. Thus, to satisfy Definition 5, it suffices to take fa = ^l 1 ^. 



SURROGATE LOSSES 



23 



Bracketing Entropy: Our second bound is a classic result in empirical process the- 
ory. For functions g\ < 52, a bracket [51,52] is the set of functions 5 G Q* with 
51 < 5 < 52; [51,52] is called an e-bracket under -^(-P) if ||5i — 52||p < £■ 
Then A/n(e, £/, L-2(P)) denotes the smallest number of e-brackets (under L2(P)) 
sufficient to cover Q. For a > 0, define the function 

J D (a, G, P) = £ yJl + ]njV^e,G,L 2 (P))de. 

Fix any % C [J 7 ], and let and C/^p be as above. Then since </n(cr, Qu,P) = 
J^(a,Qn,p, P), Lemma 3.4.2 of van der Vaart and Wellner [35] and a triangle 
inequality imply that for some universal constant c G [1, 00), for any m G N and 

a > B e (H;P), 

(21) MH;P,m) < cJ n foO^P) f 4= + J " (<r 'f*' P) M 



As-is, the right side of (21) nearly satisfies Definition 5 already. Only a slight mod- 
ification is required to fulfill the requirement of monotonicity in a. Specifically, 
define 

(22) ^ (a, H; P, m) = inf cJn (A, Q H , P) { -L + J 



A> CT U v \ ,/m A 2 



for c as in (21). Then taking fa = fa e ' suffices to satisfy Definition 5. 

Since Definition 5 is satisfied for both and fa 1 , it is also satisfied for 

(23) ^mmj^^f}. 

For the remainder of this section, we suppose fa is defined as in (23) (for all dis- 
tributions P over X x y), and study the implications arising from the combination 
of this definition with the abstract theorems above. 

5.4. VC Subgraph Classes. For a collection A of sets, a set {zi, . . . , z^} of 
points is said to be shattered by A if |{^4 n {^i, . . . , z^} : ^4 G A}| = 2 fc . The 
VC dimension vc(^4) of ^4 is then defined as the largest integer k for which there 
exist k points {z±, . . . ,Zk} shattered by A [36]; if no such largest k exists, we 
define vc(*A) = 00. For a set Q of real-valued functions, denote by vc(Q) the 
VC dimension of the collection {{(x, y) : y < g(x)} : 5 G Q} of subgraphs of 
functions in Q (called the pseudo-dimension [22, 31]); to simplify the statement 
of results below, we adopt the convention that when the VC dimension of this 



24 



HANNEKE AND YANG 



collection is 0, we let vc(Q) = 1. A set Q is said to be a VC subgraph class if 
vc(g) < oo [35]. 

Because we are interested in results concerning values of R^(/i) — R^/*), for 
functions h in certain subsets T~L C [T], we will formulate results below in terms 
of vc(^), for Qu defined as above. Depending on certain properties of £, these 
results can often be restated directly in terms of vc(%); for instance, this is true 
when £ is monotone, since vc{Qu) < vc(%) in that case [12, 22, 29]. 

The following is a well-known result for VC subgraph classes [see e.g., 35], 
derived from the works of Pollard [30] and Haussler [22]. 

LEMMA 14. For any Q C Q*, for any measurable F > ¥(Q), for any distribu- 
tion Q such that \\F\\q > 0, for any e G (0, 1), 

I \ 2vc(S) 



M(e\\F\\ Q ,g,L 2 (Q))<A(g) 

where A{Q) < (vc(g) + l)(16e) vc ( g ). 

In particular, Lemma 14 implies that any Q C Q* has, Vu G (0, 1], 

(24) J(a,G,F)< [ y/]n(eA(g)) + 2vc(G) ln(l/e)de 

J o 

< 2a^/\n{eA{Q)) + [ y/]n(l/e)de 

Jo 

= 2a^/ln(eA(g)) + a^8vc{G) ln(l/<r) + v^vrvc^erfc (^MV^)) • 

Since erfc(x) < exp{— x 2 } for all x > 0, (24) implies Vcr G (0, 1], 

(25) J(a, G, F) < aVvc(g)Log(l/a). 

Applying these observations to bound J(er, g% t p,F) for % C [J 7 ] and F > 
F(£«,p), noting J(a,g n ,F) = J(ct, £ WjP , F) andvc(^ jP ) = vc{g H ), and plug- 
ging the resulting bound into (20) yields the following well-known bound on <f$p 
due to Gine and Koltchinskii [15]. For any m G N and a > 0, 

(26) $\a,H;m,P) 



vc(gK)Log g^gk) vc^KLogC ' 1 



Specifically, to arrive at (26), we relaxed the inf F>F (g K P ) in (20) by taking F > 
F(<5h,p) sucn that ||F||p = maxjcr, HF^-^p^Ip}, thus maintaining A/||F||p G 



SURROGATE LOSSES 



25 



(0, 1] for the minimizing A value, so that (25) remains valid; we also made use of 
the fact that Log > 1, which gives us Log(||F|| P /A) = Log(||F(C? W p)|| P /A) for 
this case. 

In particular, (26) implies 



(27) M e ( 7l , l2 ;H,P) 



u t\ fn , T f\\F(& 



H,P)\\P 



< inf — n H vc(G n )Log 

<x>D,([W](72^,P);P) V7i 71/ V CT 

Following Gine and Koltchinskii [15], for r > 0, define B-^ )P (/p, r; £) = {j £ 
% : D^( 5 , /p; P) 2 < r}, and for r > 0, define 



F(Sp 



'B HlP (f*,r;e),P 



^Vl. 



T £ (r ;n,P) = sup 

When P = Pxy> abbreviate this as T^r^H) = Tiir^Ji.Vxy), and when % = 
J 7 , further abbreviate T£(ro) = t^t^F^Vxy)- For A > 0, when /p € H and P 
satisfies Condition 11, (27) implies that, 



(28) supM,(7/(4iO,7;^( 7 ;AP),P) 

7>A 



< 



A2-/3 



+ £) vc(^)Log (r t (b\P;U,P 



Combining this observation with (6), (8), (9), (10), and Theorem 6, we arrive at 
a result for the sample complexity of empirical £-risk minimization with a general 
VC subgraph class under Conditions 10 and 11. Specifically, for s : (0, oo) 2 — >• 
[1, oo), when /* £ T, (6) implies that 

M e {r e (e);T,VxY,s) < M t (T e (e);F,VxY,s) 

= sup M^( 7 /2,7;-P(7^),^xy,s(r £ (e),7)) 

7 >r £ (e) 

(29) < sup M^7/2,7;P(7^),Pxy,s(rHe),7))- 

7 >r,(e) 

Supposing Vxy satisfies Conditions 10 and 11, applying (8), (9), and (28) to (29), 
and taking s(A, 7 ) = Log we arrive at the following theorem, which is 

implicit in the work of Gine and Koltchinskii [15]. 



26 



HANNEKE AND YANG 



THEOREM 15. For a universal constant c G [l,oo), ifVxY satisfies Condi- 
tion 10 and Condition 11, £ is classification-calibrated, f* G T, and ^ is as in 
(16), then for any e G (0, 1), letting T( = [b^ g(e)^), for any m G N with 



(30) m > c ( ^ (e)2 -/? + ^) J (vc(^)Log (r<) + Log (1/5)) , 

W&A probability at least 1 — 5, ERM^ (J 7 , Z m ) produces h with er(h) — er(/*) < e. 

o 

As noted by Gine and Koltchinskii [15], in the special case when t is itself 
the 0-1 loss, the bound in Theorem 15 simplifies quite nicely, since in that case 
\\V{GB T , VxY{ r,r;Z),v XY )\? VxY = V (DIS (B (/V))), so that T t (r ) = 0(r„); in 
this case, we also have vc(^j-) < vc(J r ) and ^e(e) = e/2, and we can take (3 = a 
and b = a, so that it suffices to have 

(31) m > cae a ~ 2 (vc(^)Log (0) + Log (1/5) ) , 

where 9 = 9 (ae a ) and c G [1, oo) is a universal constant. It is known that this is 
sometimes the minimax optimal number of samples sufficient for passive learning 
[9, 19, 32]. 

Next, we turn to the performance of Algorithm 1 under the conditions of Theo- 
rem 15. Specifically, suppose Vxy satisfies Conditions 10 and 11, and for 70 > 0, 
define 

/ v P(DIS (B(/*,a£i (7)°))) w1 

XH70 = sup — V 1. 

7>70 &T* 

Note that \W(Gr 3 ,v XY )\? VxY < ?V (DIS (J? (Z e (2 1 ^) ; 01))). Thus, by (27), for 

-riog 2 (I)l <j< Llog 2 (2/^(e))J, 

(32) 

M,(2- J " 3 ^- 1 ,2 1 - J ;J- i ,Pxy) < (W (2 ~«+l2') vc(^)Log ( X t(* e (e))£) . 

With a little additional work to define an appropriate Sj function and derive 
closed-form bounds on the summations in Theorem 7, we arrive at the follow- 
ing theorem regarding the performance of Algorithm 1 for VC subgraph classes. 
For completeness, the remaining technical details of the proof are included in Ap- 
pendix A 

THEOREM 16. For a universal constant c G [l,oo), ifVxY satisfies Con- 
dition 10 and Condition 11, £ is classification-calibrated, f* G T, and is 
as in (16), for any e G (0,1), letting 9 = 9 (ae a ), \i = Xe(^e( e ))> ^1 = 



SURROGATE LOSSES 27 



vc(&r)Log(x$ + Log(l/<$), Bi = min { , Log(l/^(e))}, and C x 

I ^ TT ,Log(Z/^(e))}-'/ 



mm 



( 33 ) u ^ c ( iTr A « + 77TTT ) ^1 



and 



(34 ) „ > ete- f"- 4 ' + + 3* + T L ° g f ' ))Cl 

f/ien, with arguments t, u, and n, and an appropriate $ function satisfying (12), 
Algorithm 1 uses at most u unlabeled samples and makes at most n label requests, 
and with probability at least 1 — 5, returns a function h with er(h) — er(/*) < e. 

o 



To be clear, in specifying B\ and C\, we have adopted the convention that 1/0 = 
oo and min{oo, x} = x for any x G R, so that B\ and C\ are well-defined even 
when a = (3 = I, ora = l, respectively. Note that, when a + (3 < 2, Bi = 0(1), 
so that the asymptotic dependence on e in (34) is O (9e a ^ e {e)P- 2 Log{xe)), while 
in the case of a = /3 = 1, it is O (0Log(l/e)(Log(0) + Log(Log(l/e)))). It is 
likely that the logarithmic and constant factors can be improved in many cases 
(particularly the Log(x^), B\, and C\ factors). 

Comparing the result in Theorem 16 to Theorem 15, we see that the condition on 
u in (33) is almost identical to the condition on m in (30), aside from a change in 
the logarithmic factor, so that the total number of data points needed is roughly the 
same. However, the number of labels indicated by (34) may often be significantly 
smaller than the condition in (30), reducing it by a factor of roughly 6ae a . This 
reduction is particularly strong when 9 is bounded by a finite constant. Moreover, 
this is the same type of improvement that is known to occur when t is itself the 
0-1 loss [19], so that in particular these results agree with the existing analysis in 
this special case, and are therefore sometimes nearly minimax [19, 32]. Regarding 
the slight difference between (33) and (30) from replacing 77 by xe^ the effect is 
somewhat mixed, and which of these is smaller may depend on the particular class 
F and loss l\ we can generally bound \i as a function of 9(ae a ), ipe, a, a, b, and /3. 
In the special case of I equal the 0-1 loss, both ti and xd w& equal to 9(a(e/2) a ). 

We note that the values 5(7, m) used in the proof of Theorem 16 have a direct 
dependence on the parameters b, (3, a, and a from Condition 1 1 and Condition 10. 
Such a dependence may be undesirable for many applications, where information 
about these values is not available. However, one can easily follow this same proof, 
taking s(2~ : > ,m) = Log ^ 121 °g2( 4 ^ 2J ) i°g2( 2m ) j j ns t e ad, which only leads to an 



28 



HANNEKE AND YANG 



increase by a log log factor: specifically, replacing the factor of A\ in (33), and 
the factors (A 1 + Log( J Bi)) and (A x + Log(Ci)) in (34), with a factor of (A ± + 
Log(Log(^/^£(e)))). It is not clear whether it is always possible to achieve the 
slightly tighter result of Theorem 16 without having direct access to the values b, 
(3, a, and a in the algorithm. 

In the special case when £ satisfies Condition 3, we can derive a sometimes- 
stronger result via Corollary 9. Specifically, we can combine (27), (8), (9), and 
Lemma 12, to get that if /* £ T and Condition 3 is satisfied, then for j > 
-|~log 2 (I)] in Corollary 9, 

(35) ^{^yw-y^^ 8 

< (b (yv(u 3 )) 2 - p + yepQAjj) (vc(^)Log (lyPv{UiY /&) + s ) , 

where b and /3 are as in Lemma 12. Plugging this into Corollary 9, with s defined 
analogous to that used in the proof of Theorem 16, and bounding the summations 
in the conditions for u and n in Corollary 9, we arrive at the following theorem. 
The details of the proof proceed along similar - lines as the proof of Theorem 16, 
and a sketch of the remaining technical details is included in Appendix A. 

THEOREM 17. For a universal constant c £ [l,oo), ifVxY satisfies Con- 
dition 10, 0, is classification-calibrated and satisfies Condition 3, f* £ T, 
is as in (16), and b and {3 are as in Lemma 12, then for any e £ (0, 1), let- 
ting = 6(ae a ), A 2 = vc(&r)Log {{l/b) {a6e a /tt*(e))") + Log (1/5), B 2 = 

[a - 1 i)(2-^ ,Log (£/^i(e)) }, and C 2 = min { i 2( 1 a _ 1) , Log (l/V e (e)) }, 



mm 
if 



b (a9e a ) 



1-/3 



(36) u > c v ; ' -\ — A? 



and 
(37) 



a6e a \ 2-/3 ^ . „ / a0e c 



n > c 6(A 2 + Log(B 2 ))B 2 —— + £(A 2 + Log(C 2 ))C 2 



then, with arguments £, u, and n, and an appropriate s function satisfying (12), 
Algorithm 1 uses at most u unlabeled samples and makes at most n label requests, 
and with probability at least 1 — 5, returns a function h with er(h) — er(/*) < e. 

o 



SURROGATE LOSSES 



29 



Examining the asymptotic dependence on e in the above result, the sufficient 
number of unlabeled samples is O ^ ^(^2-/3 Log ( (^ ^(e) ) 1 1 ' anc ^ tne num t> er 

of label requests is O ^( qf^e) ) Log ^ ( ) ) 11 ™ ^ CaSC a < ^' or 

O (^ 2 ~ /3 Log(l/e)Log (^Log(l/e))) in the case that a = 1. This is noteworthy 
in the case a > and > 2, for at least two reasons. First, the number of label 
requests indicated by this result can often be smaller than that indicated by Theo- 
rem 16, by a factor of roughly O ([9e a ) l ~^ ; this is particularly interesting when 
9 is bounded by a finite constant. The second interesting feature of this result is that 
even the sufficient number of unlabeled samples, as indicated by (36), can often be 
smaller than the number of labeled samples sufficient for ERM^, as indicated by 
Theorem 15, again by a factor of roughly O ^[9e a ) l ~^ . This indicates that, in the 
case of a surrogate loss £ satisfying Condition 3 with rg > 2, when Theorem 15 is 
tight, even if we have complete access to a fully labeled data set, we may still prefer 
to use Algorithm 1 rather than ERM^; this is somewhat surprising, since (as (37) 
indicates) we expect Algorithm 1 to ignore the vast majority of the labels in this 
case. That said, it is not clear whether there exist natural classification-calibrated 
losses i satisfying Condition 3 with m > 2 for which the indicated sufficient size 
of m in Theorem 15 is ever competitive with the known results for methods that 
directly optimize the empirical 0-1 risk (i.e., Theorem 15 with t the 0-1 loss); thus, 
the improvements in u and n reflected by Theorem 17 may simply indicate that 
Algorithm 1 is, to some extent, compensating for a choice of loss £ that would 
otherwise lead to suboptimal label complexities. 

We note that, as in Theorem 16, the values $ used to obtain this result have 
a direct dependence on certain values, which are typically not directly accessi- 
ble in practice: in this case, a, a, and 9. However, as was the case for Theo- 
rem 16, we can obtain only slightly worse results by instead taking s{2~i , m) = 
Log ( 121 °g2( 4 ^ 2:i ) iog2( 2m ) \ ^ w j 1 j c j 1 a g a j n on iy leads to an increase by a log log 



factor: replacing the factor of A2 in (36), and the factors (A2 + Log(f?2)) and 
(A 2 + Log(C 2 )) in (37), with a factor of (A 2 + Log(Log(£/^(e)))). As before, 
it is not clear whether the slightly tighter result of Theorem 17 is always available, 
without requiring direct dependence on these quantities. 

5.5. Entropy Conditions. Next we turn to problems satisfying certain entropy 
conditions. In particular, the following represent two commonly-studied condi- 
tions, which allow for concise statement of results below. 

CONDITION 18. For some q > 1, p e (0,1), and F > F{Gjr^ XY ), either 



30 



HANNEKE AND YANG 



Ve > 0, 

(38) lnAr D ( £ ||F|| Pxy , Q t M(Vxy)) < qe~ 2p , 
or for all finitely discrete P, Ve > 0, 

(39) lnAf(e||F||p, Q T , L 2 (P)) < qe~ 2p . 

< 

In particular, note that when F satisfies Condition 18, for < a < 2\\F\\-p XY , 



(40) fa (a, T\ Vxy , m) < max < 



i-i i 



yq\\n p Vx y- p i~ p g~ p \\nv x p Y 

(1 - p)m l l 2 ' (! _ p) TTp777,TTp 



Since D^([J 7 ]) < 2||F||-p xr , this implies that for any numerical constant c G (0, 1], 
for every 7 G (0, 00), if Vxy satisfies Condition 11, then 

HFll 2p 

(41) M e ( C1 ,r,F,VxY) < ([ max{b 1 -V (1 " p) " 2 ^" 1 " p 7" (1+p) } • 

Combined with (8), (9), (10), and Theorem 6, taking s(A,7) = Log (^j?f^ we 
arrive at the following classic result [e.g., 6, 35]. 

THEOREM 19. For a universal constant c G [1, 00), ifVxY satisfies Condi- 
tion 10 and Condition 11, T and Vxy satisfy Condition 18, £ is classification- 
calibrated, f* G T, and ^ is as in (16), then for any e G (0, 1) and m with 

q\\F\\$ ( b l ~P v-p 

m > c- — OM — r + 



(l-p) 2 V^(£) 2_ ^ (1_p) ^t{e) l +P 

+c (^^ + ^)) Log G 

with probability at least 1 — 5, ERM^(J r , Z m ) produces h with er(/i) — er(/*) < e. 

o 

Next, turning to the analysis of Algorithm 1 under these same conditions, com- 
bining (41) with (8), (9), and Theorem 7, we have the following result. The details 
of the proof follow analogously to the proof of Theorem 16, and are therefore omit- 
ted for brevity. 



SURROGATE LOSSES 



31 



THEOREM 20. For a universal constant c £ [l,oo), ifVxY satisfies Condi- 
tion 10 and Condition 11, T and Vxy satisfy Condition 18, i is classification- 
calibrated, f* € T, and 4^ is as in (16), then for any e £ (0,1), letting B\ 

and C\ be as in Theorem 16, B% = min j 1 _ 2 ( a+f3 1 ( 1 _p)_ 2 ) , Log(l/\I/i(e))|, C3 = 



mm 



1 

l_2(«-(i+p)) ' 



Log(£/tf/(e))}, and6 = e(ae a ), if 



q\\nl p XY ( b l -» . 



(42) n> C /i ___-_ + 



(i-p) 2 V^^) 2 -^ 1 "^ ^) 1+p 



6 £ V A 



and 



(43) n>ceae« q ±^( *_*" + 



(1-p) 2 V^(£) 2 ^ (1_p) ^(^) 1+p 

; ' &giLog(gi/^ £CiLog(d/S) 



f/ien, w/f/i arguments £, u, and n, and an appropriate s function satisfying (12), 
Algorithm 1 uses at most u unlabeled samples and makes at most n label requests, 
and with probability at least 1 — 5, returns a function h with er(h) — er(/*) < e. 

o 



The sufficient size of u in Theorem 20 is essentially identical (up to the con- 
stant factors) to the number of labels sufficient for ERM^ to achieve the same, 
as indicated by Theorem 19. In particular, the dependence on e in these results is 
O (^tie f^-^- 2 ). On the other hand, when 6{e a ) = o(e" a ), the sufficient size 
of n in Theorem 20 does reflect an improvement in the number of labels indicated 
by Theorem 19, by a factor with dependence on e of O (0e a ). 

As before, in the special case when £ satisfies Condition 3, we can derive some- 
times stronger results via Corollary 9. In this case, we will distinguish between the 
cases of (39) and (38), as we find a slightly stronger result for the former. 

First, suppose (39) is satisfied for all finitely discrete P and all e > 0, with 
F < I. Then following the derivation of (41) above, combined with (9), (8), and 



32 



HANNEKE AND YANG 



Lemma 12, for values of j > — \log 2 (£)] in Corollary 9, 

2 -J-8 ^L-j 



+ (b (VViUj)) 2 ^ + IVViUj)) s, 



where q and p are from Lemma 12. This immediately leads to the following result 
by reasoning analogous to the proof of Theorem 17. 

THEOREM 21. For a universal constant c G [Loo), ifVxY satisfies Con- 
dition 10, 0, is classification-calibrated and satisfies Condition 3, f* £ T, 
is as in (16), b and f3 are as in Lemma 12, and (39) is satisfied for all finitely 
discrete P and all e > 0, with F < i, then for any e G (0, 1), letting Bi 

and C 2 be as in Theorem 17, B4 = min j 1 _ 2 ^ a _ 1) j 2 -i3(i~p)) 1 lj Og(^/ x I f ^(e))|, 
Ci = min { i-ata-W+p) » L °g (l/^lO)) }. W# = #(ae a ), if 

\ / / fe 1 ^ \ / a0e a y-W-P) ( D-p \ ( a6e a \ p " 



(l-p)Vl \*e{e)J \*i(e)J W^A*^) 



(l-p)V V V*^ 



a0e a \ 2-/3 „ , , „ , rt , / a#e a 



+ c £ 2 Log(5 2 /5)& — — + C 2 U>g{C 2 /5)t 



f/ze«, arguments t, u, and n, and an appropriate 5 function satisfying (12), 
Algorithm 1 uses at most u unlabeled samples and makes at most n label requests, 
and with probability at least 1 — 5, returns a function h with er(h) — er(/*) < e. 

o 

Compared to Theorem 20, in terms of the asymptotic dependence on e, the suffi- 
cient sizes for both u and n here may be smaller by a factor of O ( (0e Q ) 1_/3( ' 1 ~ p ^ j , 



SURROGATE LOSSES 



33 



which sometimes represents a significant refinement, particularly when 9 is much 
smaller than e~ a . In particular, as was the case in Theorem 17, when 9(e) = 
o(l/e), the size of u indicated by Theorem 21 is smaller than the known results 
for ERM^J 7 , Z m ) from Theorem 19. 

The case where (38) is satisfied can be treated similarly, though the result we 
obtain here is slightly weaker. Specifically, for simplicity suppose (38) is satis- 
fied with F = I constant. In this case, we have I > F(Gt -p ) as well, while 

A/]] (el, G^MiVu,)) = A/]] (el^ViUj), Gr 3 , L 2 {V X y)), which is no larger than 
A/]] (et^jV{Uj ),Gt,L 2 (Pxy)), so that 7} and V U] also satisfy (38) with F = t; 
specifically, 

lnA/j] {elG Tj ,L 2 (V Uj )) < qV(U 3 yPe~ 2 P. 

Thus, based on (41), (8), (9), and Lemma 12, we have that if /* G T and Condi- 
tion 3 is satisfied, then for j > — \log 2 (£)~\ in Corollary 9, 

2 -j-8 2 l ~i 

+ (b (vviUj)) 2 ^ + ivviUj)) s 



where b and /? are as in Lemma 12. Combining this with Corollary 9 and reasoning 
analogously to the proof of Theorem 17, we have the following result. 

THEOREM 22. For a universal constant c G [l,oo), ifVxY satisfies Con- 
dition 10, £ is classification-calibrated and satisfies Condition 3, /* G T, is 
as in (16), b and f3 are as in Lemma 12, and (38) is satisfied with F = I con- 
stant, then for any e G (0, 1), letting B 2 and C 2 be as in Theorem 17, = 



,wol^ 1 _„»_„„ ,Log (^iy)}> Cs = min j 1 _ 2a L -i_ p , Log ( — 



u > c 




34 



HANNEKE AND YANG 



and 



( ^ +£C 2 Log(C 2 /S) ' 



then, with arguments I, u, and n, and an appropriate s function satisfying (12), 
Algorithm 1 uses at most u unlabeled samples and makes at most n label requests, 
and with probability at least 1 — 5, returns a function h with er(h) — er(/*) < e. 

o 

In this case, compared to Theorem 20, in terms of the asymptotic dependence 
on e, the sufficient sizes for both u and n here may be smaller by a factor of 
O (^(9e a )( 1 ~^( 1 ~P^ , which may sometimes be significant, though not quite as 
dramatic a refinement as we found under (39) in Theorem 21. As with Theorem 21, 
when 6(e) = o(l/e), the size of u indicated by Theorem 22 is smaller than the 
known results for ERM^(J r , Z m ) from Theorem 19. 

5.6. Remarks on VC Major and VC Hull Classes. Another widely-studied 
family of function classes includes VC Major classes. Specifically, we say Q is 
a VC Major class with index d if d = vc({{z : g(z) > t} : g <E Q,t G K}) < oo. 
We can derive results for VC Major classes, analogously to the above, as follows. 
For brevity, we leave many of the details as an exercise for the reader. For any 
VC Major class Q C Q* with index d, by reasoning similar to that of Gine and 
Koltchinskii [15], one can show that if F = Ely > F(Q) for some measurable 
U C X x y, then for any distribution P and e > 0, 

InAA (e\\F\\ P ,g, L 2 {P)) < ~ log (J\ log (I 

This implies that for F a VC Major class, and £ classification-calibrated and ei- 
ther nonincreasing or Lipschitz, if /* € T and Vxy satisfies Condition 10 and 
Condition 1 1 , then the conditions of Theorem 7 can be satisfied with the proba- 
bility bound being at least 1 — 5, for some u = O ( ^(^2-3/2 + ^i( e )^~ 2 ^j an d 

n = 6 ( f ^2-^/2 + 6e a Vi(e)P- 2 ), where 6 = 6(ae a ), and O(-) hides logarith- 



ms) 2 -^ 2 

mic and constant factors. Under Condition 3, with (3 as in Lemma 12, the conditions 
of Corollary 9 can be satisfied with the probability bound being at least 1 — 5, for 

some „ = o I ( ^ ) ( and n = 6 ^ *" 



SURROGATE LOSSES 



35 



For example, for X = [0, 1] and F the class of all nondecreasing functions 
mapping X to [—1, 1], J 7 is a VC Major class with index 1, and 6(0) < 2 for all 
distributions V. Thus, for instance, if r\ is nondecreasing and I is the quadratic loss, 
then /* G F, and Algorithm 1 achieves excess error rate e with high probability 
for some u = O (e 2ct ~ 3 ) and n = O (e 3 ^ 1 )). 

VC Major classes are contained in special types of VC Hull classes, which 
are more generally defined as follows. Let C be a VC Subgraph class of func- 
tions on X, with bounded envelope, and for B G (0, oo), let F = .Bconv(C) = 
[x^B J2j ^jhj(x) : \ < 1, hj G C j denote the scaled symmetric convex 
hull of C; then F is called a VC Hull class. For instance, these spaces are of- 
ten used in conjunction with the popular AdaBoost learning algorithm. One can 
derive results for VC Hull classes following analogously to the above. Specifi- 
cally, for a VC Hull class F = -Bconv(C) with d = vc(C), if I is classification- 
calibrated and Lipschitz, /* G F, and Vxy satisfies Condition 10 and Con- 
dition 1 1 , then the conditions of Theorem 7 can be satisfied with the probabil- 

~ / j 2)3 n \ 

ity bound being at least 1 — 5, for some u = O (6»e Q )^ and 

~ / 2d+2 2/3 „\ 

n = O ( (9e a ) d + 2 ^t(e) d + 2 J . Under Condition 3, with /3 as in Lemma 12, the 
conditions of Corollary 9 can be satisfied with the probability bound being at least 

1 - ft for some „ = 6 ((^) (^j) 1 "*) and „ = <5 ((^j) 2 "*). 

However, it is not clear whether these results for VC Hull classes have any practi- 
cal implications, since we do not know of any examples of VC Hull classes where 
these results reflect an improvement over a more direct analysis of ERM^ for these 
scenarios. 

APPENDIX A: PROOFS 

Proof of Theorem 7. The proof has two main components: first, showing 
that, with high probability, /* G V is maintained as an invariant, and second, 
showing that, with high probability, the set V will be sufficiently reduced to provide 
the guarantee on h after at most the stated number of label requests, given the value 
of u is as large as stated. Both of these components are served by the following 
application of Lemma 4. 

Let K denote the set of values of k G N obtained in Algorithm 1. Let S denote 
the set of pairs (k' ,m') such that k' G K and Algorithm 1 reaches the value m = 
m' in Step 2 while k = k'. For each k G K, let denote the value of V 
upon obtaining that value of k in Algorithm 1 (either in Step or Step 8), and let 
D k = DIS(V^). For each (k, m) G S, let Q m denote the value of Q in Step 5 on 
the round that Algorithm 1 obtains that value of m. 

Consider any (k, m) G S. Let C m = {(m k + 1, Y mk+1 ), ...,(m, Y m )}. Note 



36 



HANNEKE AND YANG 



thatVM G V (k \ 



(44) (\Q m \\/l)(Mh;Q m )-Re(g;Q m )) 

= (m - m k ) (R^(/i Dfe ;£ m ) - R e (g Dk ;C m )) , 

and furthermore that 

(45) (\Q m \Vl)U e (V^;Q m ,s(% , to - m k )) 

= (m - m k )U t (V^; C m ,Klk^ m ~ m fe )). 

Applying Lemma 4 under the conditional distribution given k, V^ k \ m k , and 
%, we have that for any to > to/j, on an event of (conditional) probability at 
least 1 - Q e ~^ k ' m - mk \ if /* G and (k,m) G 5, then letting u fcjm = 



t/i ( ; £ m , s to - m fc )J , every h Dk G has 
(46) R e (h Dk ) - R t (f) < R e (h Dk ;C m ) - R e (f*;£ m ) + u k>m , 



r(fc) 



(47) R^(/i Dfc ;£ m ) - min R t (g Dk ;C m ) < Re(h Dk ) - Rl(f*) + u k , m , 



and furthermore 



(48) n fcjm < Ui \ Vj^;VxY,m- m k ,s(j k ,m- m k )^j . 

Letj'k = U°S2(l/7fe)J for values of A; G K. Then (12) implies 5(7^, to — m k ) = 
Sj k (m — m k ). By a union bound and the law of total probability, on an event of 
probability at least 



1 - E 



E E ^ {2i) 

k€K:%>r e (e)/2 i=l 



for every (k,m) G 5 with % > T^{e)/2, to < + n Jfe , log 2 (TO — m^) G 
N, and /* G V^ k \ the inequalities (46), (47), and (48) hold. Call this event E. 
Note that % > Y^{e)/2 implies j k < Llog 2 (2/r^(e))J . Furthermore, since each 
k G K with k > 1 has % < / y k _i/2, and 71 = £, we have j^+i > jfe + 1 

and jk > k - flog 2 (2/)l. This implies E* e i^>rv £ )/2 ee"^ 2 ') < 



E 



Llog 2 (2/r £ (e))J ^log a («j) 
j=-riog 2 Wl 



E2l 6e s j ( 2 '), so that event E has probability at least 

Lio g2 (2/r f ( £ ))j io g2 ( % ) 

E E 



SURROGATE LOSSES 



37 



For the remainder of this proof, we will suppose the event E occurs. 

Define jo = — oo and tjiq = Uj = 0. We proceed by induction, establishing the 
following claims for all k G K U {0} having j k < [\og 2 (2/Te(e))\ . 

Claim 1: max{m GN: (k, m) G 5}U{mjt} < + uj k . If equality is obtained, 

then we also have k + 1 G K. 
Claim 2: If k + 1 € then Vh G R^ fc+1 ) - R<(/*) < 2 7fe+1 . 

Claim 3: If k + 1 G K, then /* G V^ k+1 \ 

We can think of k = as a base case for this inductive proof, since then the first 
claim is trivially satisfied, while the second claim is satisfied due to K^(hDi) — 
I < 271, and the third claim is satisfied by assumption (since = F). Now 
suppose these three claims hold for k equal k' — 1, for some k' G N with k' G K 
andj fc , < Llog 2 (2/r / (e))J. 

If it happens that (£;', rah* +Uj k , ) G S, then by definition of u 3y and monotonic- 
ity of in 1 — y Ue(-, •; ■, m, •), we have 

tit{T 3y ^-V XY ,u hnhy (u jkl )) <2-^" 2 . 

Plugging in the definition of jy, by (12) and (7), this implies 

(49) Ui(j c jk ,,2%r,VxY,u jkn s(%',Uj k ,)) <%'/2. 

(k 1 ) 

Furthermore, since /* G F, Claim 2 and the definition of £^(-) imply ' C 
[J 7 ] (£^ (2%') ; 01). Since Claim 3 and the definition of imply sign(/i£> fe , ) = 
sign(/i) for all h G V^'), we have er(/i) = er(/i£> fe , ) for all /i G V( fc '\ so that 
y( fc/ ) C [J 7 ] (£^ (2-7^/) ; 01); we also have V k ' C J 7 , so that together these imply 

(50) V C J" (£, (2%, ) ; 01) C T (E t (2 1 ^' ) ; 01) . 

This also implies D k > C DIS (J 7 (£^ (2 1 -Jfc') ; i)). Combined with (49) and (7), 
these imply 

^ (VdJ ^ 2A fk';V X Y,u jk ,,5 (%',u jh ,y\ < %>/2. 
Together with (6), this implies 

U t (yg'J (2%, ; t) ■ Vxy , u jk , , S (%, , u 3y ) ) < %,/2. 

(k') (k') 

Claim 2 implies Vq / = Vf) ' [2^;£), which means 



38 HANNEKE AND YANG 

Since log 2 (it jV ) G N, jy < Llog 2 (2/T £ (e))J , and Claim 3 implies /* G V^ k '\ 
combining the above with (48) implies that on the event E, 

By (45), this also means 

Ue (v^;Q mkl+Ujk/ ,s (jy,u jk ,)) " < %,/2. 

The left hand side of this inequality is precisely the value 

A \Qm k ,+u jk , \ V 1 



Til 

V J Qm k i+Uj k , ? Wife' + 1ij fe , , J 



so that the condition in Step 5 of Algorithm 1 will be satisfied if and when k = k' 
and m = my + Uj ,. In summary, we have shown that if (k',my + Uj ,) G S, 
then maxjm G N : (k',m) G 5} U {my} = my + Uj , and k' + 1 G if. 
Furthermore, since {m G N : (A/, m) G S} U {m^/} is a sequence of consecutive 
integers including my, if (k',my + ,) ^ 5, then max{?n G N : (k',m) G 
S} U {rrifc/} < mfc' + Uj ,. In either case, we have established Claim 1 for k equal 
to k'. 

Next we consider Claim 2 and Claim 3. If k' + 1 ^ if , then Claim 2 and Claim 
3 are trivially satisfied for k equal to £/. Otherwise, suppose k' + 1 G if. Let 
m' = maxjm G N : (k',m) G 5}U{m,fc/}. By Claim 1, we have m' < my+Uj ,. 
Furthermore, fc' + 1 G if implies that the condition in Step 5 in Algorithm 1 is 
satisfied for k equal k! and m equal m' , so that log 2 (W — my) G N and 



Rl(h;Q m >)- mm R e (g;Q m/ ) <U e (v {k,) ;Q m >,s(%>,m' -my)) >. 

By (44) and the definition of this is equivalently expressed as 

(51) 

= \ he : R £ (^ fc ,;£ m - min R e (g Dk ,; £ m ,) < \ . 

By Claim 3, /* G thus, (46) implies that on the event E, every h G W fc ' +1 ) 

has 

R^(^D fe ,) - Ri(f*) < 7k'+i + Ue (vj£J;£ m ',s (%>,m' - my 



SURROGATE LOSSES 



39 



By (45), this is equivalently expressed as 

MhD k ,)-Mn <27fc'+i- 

Since R^(/i£) fe , +i ) < Ke(hD k ,), we have established Claim 2 for equal to k! . 

Furthermore, Claim 3 implies /* G y( fe '), so that by (47), on the event E, we 
have 

R<(/*;An') - RtfCsiV^m') < ^ (v$^;£ m /,i (%>,m' - m k >)) . 

By (45) and the definition of 7fc'+i, the right hand side of this inequality is equal to 
jk'+i- In particular, combined with (51), this implies /* G V^ k +l \ which estab- 
lishes Claim 3 for k equal to k'. 

Finally, note that jk is nondecreasing, so that the values of k G K U {0} with 
ifc < Uof-teO^AM^))] form a sequence of consecutive integers starting with 0. 
Thus, by the principle of induction, these three claims hold (on event E) for all 
k G K for which jf. < |_log 2 (2/r*(£-))J. 

Similar to above, by Claim 2, for all k G K with j k -i < Llog 2 (2/T£(e))J, 

(52) vg ) C.T(e < (27fc);oi). 

Since every /i G has sign(/i(x)) = sign(/*(x)) = sign(/i£> fe (x)) for all 
x ^ Dfc, we have that V/i G er(/i) = er(/io fe ). Thus, since W fc ) C T and 

/* G F, (52) implies 

(53) ^ C J-(£ £ (2 7fc );oi). 

In particular, letting A;* = max{/c G K : jk < Llog 2 (2/r^(e))J }, if k* + 
1 G K, then jk*+i > [log 2 (2 /T i(e)) \ , so that < Te(e)/2, which means 

8, e {2%* +1 ) < e. Together with (53), this implies C T*(e;oi). Since the 

update in Step 7 always keeps at least one element in V, the function h in Step 
9 exists, and has h G y( max ^) = f| fce/f C C J"*(e;oi), so that 

er (ji^j — er (/*) < e, as claimed. 

All that remains is to bound the sizes of u and n sufficient to guarantee k* + 1 G 
Jv". By Claim 1, A;* + 1 G if would be guaranteed as long as 



k* 

(54) u>Y,u jk 
k=l 



m k *+u jk * fe* — 1 m fe+1 

and n> ^ l Dl ,(I m )+^ ^ l Dt (l m )- 

m=m fc *+l k=l m=m k +l 



40 



HANNEKE AND YANG 



Since every k < k* has — [log 2 (^)] < jk < \\og 2 {1/Yi{e))\ , and (as noted above) 
3k > Jk-i + 1, we have that 

k * Uog 2 (2/r,( £ ))j 
(55) ^ u h < u r 

k=l i=-flog 2 (I)l 
Furthermore, for all k < k*, (53) and monotonicity imply that 

D k C DIS (J 7 (£* (2 1 ^' fc ) ; oi)) = DIS {JF jk ) = U jk , 

so that 

™fc*+«j fc * fe*-l m fc+ i 

m=m k *+l fc=l m=m; c +l 

m k *+u jk , k *_i mk+1 

< E k.^+E E k 

m=m fe *+l fc=l m=rrife+l 

Since is nonincreasing in j, we have that ty. (X m ) is nonincreasing in j for all 
m. Combining this with Claim 1 and the above properties of j k , we have 

E k. E 

m=m fc *+l k=l m=m k +l 

Lio g2 (2/rf( e ))j E- = _ riog2( | )1 % 

< E E ^(^m)- 

i=-rio S2 wi m=i+E^ riog2( ,- )1 « i 

In summary, we have 



m k*+ u o k * k*-l m k+1 

(56) Yl 1ivP^)+E E W*m) 

m=m k *+l k=l m=m k +l 

Llog 2 (2/rf(e))J Si=-riog 2 (F)l 
< E E V*™). 

Note that the indicators 1^. (X m ) in the summation on the right hand side of (56) 
are independent, so that a Chernoff bound implies that on an event E' of probability 



SURROGATE LOSSES 



41 



at least 1 - 2" s , 
(57) 

Llog 2 (2/r,(e))J £ 

E E 



p°g 2 (f)l u i 



Llog 2 (2/r f (e))J 

l^(X m )<s + 2e P(Uj)uj. 

j=-\log 7 (l)] 

-rio g2 (Qi 

Combining (54), (55), (56), and (57) implies that, for u and n as in the statement 
of Theorem 7, on the event E n we have A;* + 1 £ K. A union bound implies 
that the event E C\ E' has probability at least 

Llog 2 (2/r f ( £ ))J Iog 2 (u,) 

l-2" s - £ £ 6e-^\ 

i=-rio g2 wi i=i 

as required. □ 

Proof of Lemma 8. If P (DISF(T^)) = 0, then <fo(%; m, P) = 0, so that in 
this case, ^ trivially satisfies (5). Otherwise, suppose P (DISF(T^)) > 0. By the 
classic symmetrization inequality [e.g., 35, Lemma 2.3.1], 

4>t{U,m,P) <2E\ fa(H;Q,Z lm 



where Q ~ P m and Hr m 
dent. Fix any measurable U D DISF(%). Then 



(58) 



E 



<^(%;<3,Er m ]) 



{Cij • • • > Cm} ~ Uniform({— 1, +l} m ) are indepen- 

\QC\U\ 



E 



^(7^;QnW,S[|Qnw|]) 



??? 



where = . . . , £,y} for any q 6 {0, . . . , m}. By the classic desymmetriza- 
tion inequality [see e.g., 24], applied under the conditional distribution given \Q n 
W|, the right hand side of (58) is at most 
(59) 



E 



2<f> £ (H,\QnU\,P u ) 



\Qr\U\ 



rn 



E 



+ sup \R e {h; P u ) - Re(g; P u )\ 

h,gen 



^/\Q(TU\ 



m 



By Jensen's inequality, the second term in (59) is at most 



sup \R e (h;P u )-Re(g;Pu)\ 

h,geH 



P(U) 



<V e (H;P u ) 



m \ m 



Decomposing based on \Q n U\, the first term in (59) is at most 
(60) E 



2<j> t {H, \Q n U\,P U ) \ QnU h [\Q n U\ > {l/2)P{U)m\ 



m 



+ 2eP{U)F(\QnU\ < (l/2)P(W)m) 



42 



HANNEKE AND YANG 



Since \QnU\ > (l/2)P(W)m => \QnU\ > \(l/2)P(U)m~\, and <j> t (H, q,Pu) is 
nonincreasing in g, the first term in (60) is at most 



2WH,\(l/2)P(U)m\,P u )E 



\Qr\U\ 



2<t> £ (H,\(l/2)P(U)mlPu)P(U), 



m 

while a Chernoff bound implies the second term in (60) is at most 

W£ 

2£P(U)exp{-P(U)m/8} < . 

m 

Plugging back into (59), we have 
(61) 

MH,m,P) < 4<Mft, \(l/2)P(U)m],P u )P(U) + — + 2D t (H;P)J-. 

m V m 

Next, note that, for any a > D^(ft; P), , a > D £ (ft; P w ). Also, if U = W x y 

v Piu) 

for some W 5 DISF(ft), then /* w = /*, so that if /* £ ft, (5) implies 

(62) r(l/2)P(W)ml,P w ) < & \—Z=,H-, \{l/2)P(U)m\ , P u J . 

Combining (61) with (62), we see that ^ satisfies the condition (5) of Definition 5. 

Furthermore, by the fact that satisfies (4) of Definition 5, combined with the 
monotonicity imposed by the infimum in the definition of it is easy to check that 
Hp. also satisfies (4) of Definition 5. In particular, note that any ft" C'H'C [J 7 ] and 
U" C X have DiSF(ft^„) C DISF(ft'), so that ^ ran § e of W in the infimum is 
never smaller for ft = ft^„ relative to that for ft = ft'. □ 

Proof of Corollary 9. Let 4> t be as in Lemma 8, and define for any m e 

N, s e [1, oo), C e [0, oo], and ft C [J 7 ], 

U' e (n,C;T X Y,m,s) 

= ft fe(D,([ft](C;^)),ft;m,ftxF) +D,([ft](C;^)) A /^+ - 
\ V m m 

That is, ft^ is the function Ut that would result from using in place of (f>£. Let 
U = DISF(ft), and suppose V{U) > 0. Then since DISFQftj) = DISF(ft) 
implies 



D,([ft](C;£)) = D ( ([«]((;4P M )VP(Wj 



T> t {[U]{C/V{U)-,i,Vu),Vu)y/vW)i 



SURROGATE LOSSES 43 
a little algebra reveals that for m > 2V(U)~ 1 , 

(63) U' e (H,C,V X Y,m,s) < 33V(U)U e (n,(/V(Uy,V u , \(l/2)V(U)m],s). 

In particular, for j > — |~log 2 (l)] , taking % = Tj, we have (from the definition of 
Tj) U = DISF(ft) = DlS(ft) = Uj, so that when V{Uj) > 0, any 



m > 



suffices to make the right side of (63) (with s = Sj(m) and ( = 2 1 "- 7 ) at most 
2~ J ~ 2 ; in particular, this means taking uj equal to any such m (with log 2 (m) £ N) 
suffices to satisfy (13) (with the in (13) denned with respect to the 4>' e function); 
monotonicity of £ h- > ((,, ; ; Vuj > Sj (m)\ implies (15) is a sufficient 

condition for this. In the special case where V(Uj) = 0, U'^Tj, 2 1 ^'; V X Y , m, s) 
= K^, so that taking Uj > K£sj(uj)2^ +2 suffices to satisfy (13) (again, with the 
in (13) defined in terms of Plugging these values into Theorem 7 completes 
the proof. □ 

Proof of Theorem 16. For -[log 2 (I)] < j < |_log 2 (2/^(e))J, let Sj = 

48(Llog 2 (8/^( £ ))J-j) 2 \ . , , qi . _ 9 riog 2 K)l 



Log \ ™^2W*n*m-j> j t and define Uj = 2 \">toWj) I , where 
(64) u'j = d (b2^ 2 ~^ + £2 j ) (vc (G T ) Log ( Xt ?) + sj) , 



for an appropriate universal constant d G [1, oo). Note that, by (32), (8), and (9), 
we can choose the constant d so that these uj satisfy (13) when we define 



m I—?- 5j{m) = Log 



121og 2 (4 Uj /m) 2 (Llog 2 (8/^(e))J - j f 



Additionally, let s = log 2 (2/5). 
Next, note that 

Llog 2 (2/r>(e))J Ll««a(2/*<(e))J 

j=-rio g2 ®i i=-riog 2 wi 



U 3 



* * + 4>) ( vc Log w> + Log (f )) 

(65) +40^ ^ ^ +^ £ 2aog(Llog 2 (8/^( £ ))j-i). 



44 



HANNEKE AND YANG 



We can bound this last summation by noting that 

Llog a (2/*i(e))J 

(66) Yl y'Log(Llog 2 (8/^(e))J-i) 

j=-\log 2 (l)] 

Llog 2 (2/^( E ))J 

^ ^) E 2^L^(^( £ ))J Lo g(Llog 2 (8/*,( £ ))j -j) 

oo oo 

Plugging this into (65), we have that y^ log2 r, 2//r ^f^ «i is at most 

8c ' + ( vc <fc) LoE w> + Log (^) ) • 

Thus, by choosing c > 160c', any u satisfying (33) has u > Sj-^nowlyn w j' as 
required by Theorem 7. 

For i/j as in Theorem 7, note that by Condition 10 and the definition of 6, 

V (Uj) = V (DIS (F (E t (2 1 "') ; oi))) < V (DIS (b (/*, a£, (2 1 ^) a ))) 
< #max{a£ £ (2 1 -'') ,ae a } < 0max ja^" 1 (2 l ~i) a ,ae a } . 

Because is strictly increasing on (0, 1), for j < |_log 2 (2/#e(e))J, WJ 1 
> e, so that this last expression is equal to Oa^J 1 (2 1_ - ? ) a . This implies 

Llog 2 (2/r f ( £ ))J Llog 2 (2/*He))J 
j=-\log 2 (£)] i=-flog 2 (/)] 

(67) 

Llog 2 (2/^( £ ))J 

Y aMj 1 (2 l ^) a {b2^ + IV) (A, + Log( Llog 2 (8/*,( £ ))J - j)) . 

j=-rio g2 wi 

We can change the order of summation in the above expression by letting i = 

[log 2 (2/tt/(e))J - j and summing from to N = \log 2 (£)] + |_log 2 (2/tt*(e))J. 

In particular, since 2^2(2/^(2))] < 2/*/(e), (67) is at most 

(68) 

g ^ (*-»»W>™*y + (A, + Log(i + 2)) . 



SURROGATE LOSSES 45 

Since x i-> ^Sj\x)/x is nonincreasing on (0,oo), tf^ 1 (2 1 -Liog 2 (2/^(e))J 2 ») < 
2«+i^/-i ^2-L 1 °g2(2/*<(e))J ^ and since ^-1 is i ncreas i n g 5 this latter expression is 
at most 2 i+1 ^7 x (^e(e)) = 2 i+1 e. Thus, (68) is at most 

(69) 8a#e" £ ( -^-^ + -^y ) (A, + Log(* + 2)) . 



i=0 



In general, Log(i+2) < Log(iV+2), so that 2 i(a+/3_2) (A 1 + Log(« + 2)) < 
(A 1 +Log(N+2))(N+l) and 2 i(a_1) ( A i + L °g(* + 2 )) ^ (^i+Log(iV+ 
2))(N + 1). When a + /? < 2, we also have ^= 2 i ( a+ / 3 ~ 2 ) < Y^Zo 2 J ( Q+/3 ~ 2 ) 
= TZ^OTT and Y^ 2^+^ 2 )Log(* + 2) < ££o 2 i ^- 2 )Log(i + 2) < 

I^feyLog (tz^ot) ■ Similarly, if a < 1, 2**-D < £~ 

= and likewise ^"^Log^ + 2) < ^Zo ^""^Logf. + 2) < 

1 _ 2 ( 2 Q _i) Log ^ 1 _ 2 (a-i) l- By combining these observations (along with a conven- 
tion that 1 _ 2 ( 1 C ,_ 1) = oo when a = 1, and 1 _ 2(q 1 +/3 _ 2) = oo when a = j3 = 1), we 
find that (69) is 

< a9ea n(A 1 +Log(B 1 ))B 1 + I(A X + Log(Ci))Ci 



Thus, for an appropriately large numerical constant c, any n satisfying (34) has 

Lio g2 (2/r>( £ ))J 
n>s + 2e ^2 V(Uj)uj, 

j=-rio g2 (£)i 

as required by Theorem 7. 

Finally, we need to show the success probability from Theorem 7 is at least 1 — 5, 
for Sj and s as above. Toward this end, note that 

Lio g2 (2/r £ ( £ ))j io g2 (u 3 ) 

E E 6 ^ (2I) 

j=-[log 2 (I)l i=1 

Llog 2 (2/* < (e))J log 2 K) § 

' i= _S 2 Wl S 2(log 2 (4^)-,) 2 (Llog 2 (8/^( £ ))J-i) 2 

Llog 2 (2/^( £ ))J log 2 (iy) 

= E E 



- 2(i + l)2([log 2 (8/>P,(e))j-j) 



2 



3=-poga(/)l * : 
[log 2 (2/*,( e ))J oo 

< E 2<E^7-^2< 5 / 2 - 

i=-fe©l 2 CLlog 2 (8/^(e))j - j) 2 ^ 2(i + l)^ 



46 



HANNEKE AND YANG 



Noting that 2 s = 5/2, we find that indeed 

[log 2 (2/r*(e))J log 2 ( Uj ) 

i-2~ s - Yl E 6e " Sj(2 ° ^ 1 - 6 - 

i=-rio g2 (/)i i=i 

Thus, by taking s to be the function satisfying (12) such that s{2~i , •) = Sj(-) for 
all j G Z, Theorem 7 now implies the stated result. □ 

Proof Sketch of Theorem 17. The proof follows analogously to that of 
Theorem 16, with the exception that now, for each integer j with — [log 2 (^)] < 
j < |_l°S2(2/^( e ))J ' we replace the definition of u'a from (64) with the following 

definition. Letting Cj = vc(£j-)Log (a92^~ 1 (2 1 ^) a )^ , define 

u'j = c' (b2^ 2 ~^ {a9^ 1 {2 l ^) a ) l ~ P + £2^ ( Cj + 8j ) , 

where d € [1, oo) is an appropriate universal constant, and Sj is as in the proof 
of Theorem 16. With this substitution in place, the values Uj and s, and functions 
Sj and s, are then defined as in the proof of Theorem 16. By (35), (9), (8), and 
Lemma 12, we can choose the constant c' so that these uj satisfy (15). By an 
identical argument to that used in Theorem 16, we have 

Liog 2 (2/rv( e ))j iog 2 ( Mj ) 
1-2-*- ]T 6e- s ^ 2 ') > 1-5. 

j=-\log 2 (E)] »=1 

It remains only to show that any values of u and n satisfying (36) and (37), respec- 
tively, necessarily also satisfy the respective conditions for u and n in Corollary 9. 

Toward this end, note that since x i-)- x^~^ l {l/x) is nondecreasing on (0,oo), 
we have that 

Llog 2 (2/r,(e))J Llog 2 (2/*i(e))J 

i=-riog 2 Mi i=-rio g2 wi 

~ 6 fey + ' M> + ¥ , m 2 ' Log (Llog2(8/ *' (E,)J -» 

where this last inequality is due to (66). Thus, for an appropriate choice of c, any u 
satisfying (36) has u > Ylf—^^^fy] u j< as required by Corollary 9. 



SURROGATE LOSSES 



47 



Finally, note that for Uj as in Theorem 7, and ij = Llog 2 (2/^v(e))J — j, 

Lio g2 (2/r £ ( £ ))J Lioga(2/*/(6))J 

V{Uj) Uj < a0^ l (2 1 ^ j ) a u j 
i=-pog a ®l j=-\io g2 (i)] 

Llog a (2/*i(e))J 

< J2 b {a02^J 1 (2 1 ^r) 2 ^ (A 2 + Log (i 3 + 2)) 

j=-\lo g2 (£)] 

Llog 2 (2/^( £ ))j _ 

+ ia92^J 1 (2 1 ^) a (A 2 + Log(i J +2)). 

j=-rio g2 (£)i 

By changing the order of summation, now summing over values of ij from 
to Llog 2 (2/*/(e))J + \log 2 (£)], letting N = [log 2 (2l/^(e))], and noting 
2 Liog 2 (2/*<(e))J < 2/^(e), and ^7 1 (2-L 1 °g2(2/^(^))J2 1 + i ) < 2 l+i e for i > 0, 
this last expression is 

(70) <E b [^r) ^ + L <«( < + 2 » 

+ E <g /(e) (^2 + Log (» + 2)). 

Considering these sums separately, we have Ya=q 2^ a ~ 1 ^ 2 -^ (A 2 +Log(i+2)) < 
(N + 1)(A 2 + Log(N + 2)) and J2?=o 2i(a ~ 1] ( A 2 + L og(i + 2)) < (TV + 1)(A 2 + 
Log(/V + 2)). When a < 1, we also have Y,i=Q 2i{a ~ l){2 ~ P) ( A 2 + Log(i + 



2)) < E l =o2 i(Q - 1)(2 - /3) (^2 + Log(i + 2)) < 1 „ 2(Q _ 2 1)(2 _, ) L g( 1 ^ 2(Q _ 1 1)(2 _, ) j + 
r ^ m=m A 2 , and similarly Zlo ^^(A, + Log(i + 2)) < j^jA 2 + 
r Log (j^rj) - Thus, generally ^Zo 2^ 1 ^) (A 2 + Log(z + 2)) < 



l_2(«-i) " 

5 2 (A 2 + Log(J3 2 )) and ^^"^(^ + Log(i + 2)) < C 2 (,4 2 + Log(C 2 )). 
Plugging this into (70), we find that for an appropriately large numerical constant 
c, any n satisfying (37) has n > Ylj^n^^] 'P(Mj) u j> as required by Corol- 
lary 9. □ 

REFERENCES 

[1] K. S. Alexander. Rates of growth and sample moduli for weighted empirical processes indexed 
by sets. Probability Theory and Related Fields, 75:379-423, 1987. 5.2 

[2] J.-Y. Audibert and A. B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of 
Statistics, 35(2):608-633, 2007. 1.1, 2.2 



48 HANNEKE AND YANG 

[3] M.-E Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of 

the 23rd International Conference on Machine Learning, 2006. 4 
[4] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer 

and System Sciences, 75(l):78-89, 2009. 4 
[5] M.-F. Balcan, S. Hanneke, and J. W. Vaughan. The true sample complexity of active learning. 

Machine Learning, 80(2-3): 11 1-139, 2010. 5.2, 5.2 
[6] P. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journal 
of the American Statistical Association, 101:138-156, 2006. 1, 1.1, 2.1, 2.1, 2.1, 2.2, 2.2, 2.2, 
2.2, 2.2, 2.4, 3.2, 4, 5.1, 5.1, 5.1, 5.1, 5.1, 5.5 
[7] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. The Annals of 

Statistics, 33(4): 1497-1537, 2005. 1, 2.4 
[8] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Pro- 
ceedings of the 26 th International Conference on Machine Learning, 2009. 1.1, 3.2, 4, 5.2 
[9] R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Transactions on Informa- 
tion Theory, 54(5):2339-2353, July 2008. 5.4 
[10] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273-297, 1995. 2.2 
[11] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In 

Advances in Neural Information Processing Systems, 2007. 4, 5.2 
[12] R.M.Dudley. Universal Donsker classes and metric entropy. The Annals of Probability, 15 

(4): 1306-1326, 1987. 5.4 
[13] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an 
application to boosting. Journal of Computer and System Sciences, 55(1): 1 19—139, 1997. 2.2 
[14] E. Friedman. Active learning for smooth problems. In Proceedings of the 22 nd Conference on 

Learning Theory, 2009. 5.2, 5.2 
[15] E. Gine and V. Koltchinskii. Concentration inequalities and asymptotic results for ratio type 
empirical processes. The Annals of Probability, 34(3): 1 143-1216, 2006. 2.4, 2.4, 3.1, 4, 5.2, 
5.3,5.4,5.4,5.4,5.4,5.6 

[16] E. Gine, V. Koltchinskii, and J. Wellner. Ratio limit theorems for empirical processes. In 
Stochastic Inequalities, pages 249-278. Birkhauser, 2003. 5.3 

[17] S. Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of 
the 24 th International Conference on Machine Learning, 2007. 5.2, 5.2 

[18] S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Machine Learning De- 
partment, School of Computer Science, Carnegie Mellon University, 2009. 1, 5.2 

[19] S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333— 361, 
2011. 1,5.1,5.2,5.2,5.4,5.4 

[20] S. Hanneke. Activized learning: Transforming passive to active with improved label complex- 
ity. Journal of Machine Learning Research, 13:1469-1587, 2012. 4, 4, 5.1, 5.2, 5.2 

[21] S. Hanneke and L. Yang. Negative results for active learning with convex losses. In Proceedings 
of the 13 th International Conference on Artificial Intelligence and Statistics, 2010. 3.2 

[22] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other 
learning applications. Information and Computation, 100:78-150, 1992. 5.4 

[23] V. Koltchinskii. Local rademacher complexities and oracle inequalities in risk minimization. 
The Annals of Statistics, 34(6):2593-2656, 2006. 1, 2.4, 2.4, 3.1, 4, 5.1, 5.1 

[24] V. Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery prob- 
lems: Lecture notes. Technical report, Ecole d'ete de Probabilites de Saint-Flour, 2008. A 

[25] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. 
Journal of Machine Learning Research, 11:2457-2485, 2010. 1,1.1, 3.2, 4, 5.2 

[26] S. Mahalanabis. A note on active learning for smooth problems. arXi\: 1103.3095, 201 1. 5.2, 
5.2 

[27] E. Mammen and A. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27: 



SURROGATE LOSSES 



49 



1808-1829, 1999. 2.4,5.1,5.1 
[28] S. Minsker. Plug-in approach to active learning. Journal of Machine Learning Research, 13 
(l):67-90, 2012. 1.1 

[29] D. Nolan and D. Pollard. U-processes: Rates of convergence. The Annals of Statistics, 15(2): 
780-799, 1987. 5.4 

[30] D. Pollard. Convergence of Stochastic Processes. Springer- Verlag, Berlin / New York, 1984. 
5.4 

[31] D. Pollard. Empirical Processes: Theory and Applications. NSF-CBMS Regional Conference 
Series in Probability and Statistics, Vol. 2, Institute of Mathematical Statistics and American 
Statistical Association, 1990. 5.4 

[32] M. Raginsky and A. Rakhlin. Lower bounds for passive and active learning. In Advances in 
Neural Information Processing Systems 24, 2011. 5.2, 5.2, 5.4, 5.4 

[33] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statis- 
tics, 32(1):135-166, 2004. 2.4, 5.1, 5.1 

[34] A. van der Vaart and J. A. Wellner. A local maximal inequality under uniform entropy. Elec- 
tronic Journal of Statistics, 5: 192-203, 201 1. 2, 5.3, 5.3 

[35] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 
1996. 2.4,5.3, 5.3, 5.4,5.5, A 

[36] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events 
to their probabilities. Theory of Probability and its Applications, 16:264-280, 1971. 5.4 

[37] L. Wang. Smoothness, disagreement coefficient, and the label complexity of agnostic active 
learning. Journal of Machine Learning Research, 12:2269-2292, 2011. 5.2, 5.2 

[38] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk 
minimization. The Annals of Statistics, 32(1):56-134, 2004. 1.1 



Department of Statistics 
Carnegie Mellon University 
5000 Forbes Avenue 
Pittsburgh, PA 15213 USA 
E-MAIL: shanneke@stat.cmu.edu 



Machine Learning Department 
Carnegie Mellon University 
5000 Forbes Avenue 
Pittsburgh, PA 15213 USA 
E-MAIL: liuy@cs.cmu.edu 



